AWS Glue custom classifier enables you to catalog the data in the way you want when AWS Glue built-in classifiers cannot. It is important to catalog the data correctly and the classifier plays an important role in identifying the structure of underlying data.
If the built-in classifiers do not catalog the data as you need, then there are a couple of options to go for but it is not limited to those. The first option is to populate or prepare the source data in the format that is supported by AWS Glue built-in classifiers. The second option could be to catalog the data manually if it is feasible. As a third option, create a custom classifier to determine or parse the structure of data, the way we want. At the same time, sometimes the custom classifier also fails, and in that case, consider changing the format of data via some transformations.
Let’s consider S3 as the data source, or it could be anything. The ideal step to catalog the data is to create the crawler, which will scan the underlying followed by determining the potential columns and their data type. Post-identification of the structure of data, it will populate the table definition as a part of the data catalog.
The answer is, it will determine the schema using the classifiers. So, the role of the classifier is to read the data and identify its structure of it and help to catalog the data correctly.
A short definition of a classifier is, A classifier is a configuration, either built-in or custom, that is leveraged as a part of the crawler to infer the source data and parse the structure or schema of the underlying data.
While creating the crawler, one can choose one or more custom classifiers to infer the schema and for a built-in classifier, there is no additional configuration is required. Assuming that, the crawler that is created has one or more custom classifiers. This is how it will choose the classifier
- The crawler will start the match using the custom classifiers first and subsequently leverage other classifiers if required
- The crawler determines if the said classifier is a good fit or not based on the certainty score. Match via each classifier returns the certainty score. If any classifier returns a certainty score of 1.0, then the crawler will make the decision that the particular classifier is a good fit and it can create the correct table definition. The crawler will stop further matching with other classifiers if any classifier returns a certainty score of 1.0
- If none of the custom classifiers returns a certainty score of 1.0, then the crawler will initiate the match using the AWS Glue built-in classifiers. The match via built-in classifiers will either return 1.0 (If there is a match) or 0.0 (If no match is found)
- If no classifier (custom or built-in) returns a certainty score of 1.0, then it will pick the classifier with the highest certainty score
- Let’s say there is no classifier that returned a certainty greater than 0.0, then AWS Glue will return the default classification string of UNKNOWN
There are two types of classifiers; AWS Glue Built-in classifiers and Custom classifiers. Here, is the list of built-in and custom classifiers that AWS Glue supports as of today. With respect to built-in classifiers, there is no additional configuration required to parse the structure of data because AWS Glue will internally figure out which built-in classifier to use based on the certainty score.
For custom classifiers, there are 4 types available which are grok, JSON, XML, and CSV custom classifiers.
It can classify the files which are in ZIP, BZIP, GZIP, LZ4, and Snappy compression formats.
When the AWS Glue built-in classifier is unable to create the expected or required table definition, then one should consider creating & using the custom classifier.
Grok is a tool to parse textual data using a grok pattern. Grok pattern is the named set of a regular expression. For example, defining a regular expression pattern to match email addresses and then the name is given to that pattern like an EMAILPARSER, and this is a named set of the regular expression. (EMAILPARSER regular-expression).
Syntax of grok pattern:
- PATTERNNAME is the name of the pattern that will match the text. (PATTERNNAME could be EMAILPARSER from the previous example)
- field-name is the alias name and can be anything
- data-type is optional. By default, the data type will be a string if not mentioned. The supported data types are byte, boolean, double, short, int, long, string, and float
For example, if you want to match the month number from the text, then
- Define the regular expression to match the same and then we will provide the name to that pattern
- Named pattern:
- In this case, the name of the pattern that we have given is MONTHNUM, which is followed by a regular expression.
- To define the grok pattern using the defined syntax, which will be the reference to the defined pattern_name followed by the alias name. And the pattern name is MONTHNUM and the alias name can be anything. Finally, casting it to int datatype
- Grok pattern:
AWS Glue provides built-in patterns and if required the custom pattern can be defined. For example, if AWS Glue has MONTHNUM defined as a part of the built-in patterns, then we can directly use the name of that pattern to define grok pattern. But if there is something that you want to match or parse that is not defined as a part of the AWS Glue’s built-in pattern then you have to define the custom pattern like MONTHNUM followed by the regular expression. Hence, the grok pattern can be defined using AWS Glue’s built-in patterns and custom patterns.
Grok custom classifier example
Consider a log file (as shown in the Log data image). In this log file, there is UUID, WindowsMAC address, and Employee Id. The requirement is to catalog this structure and create the table definition via grok custom classifier.
Below is a snapshot that shows a couple of AWS Glue built-in patterns, that we will use. But there is a long list of AWS Glue built-in patterns to which you can refer here. The keyword you see initially is the pattern name (highlighted in the blue box), and it is followed by the regular expression (highlighted in the purple box).
Starting with the UUID, the pattern to parse UUID is already defined in AWS Glue built-in pattern (highlighted in the red box), then it is the same for Windows MAC as well. But for Employee Id, which is a 7-digit number, there is no built-in pattern, so in that case, the custom pattern is required. In simple words, if we have to define the pattern name and regular expression to match then it is the custom pattern.
If we put together all the patterns, then it will look like the grok pattern defined as a part of the Final grok pattern in the above table.
As a part of hands-on, we will parse the log file looks like below. I have modified this log file to include the random dummy email addresses.
Breaking down the structure of a log. Starting with a Timestamp in square braces, then the Log type which could be an error or notice, followed by the Email, and finally the Message. Now, let’s see which patterns will be used among built-in and custom to define grok patterns.
Timestamp: A custom pattern is defined using the AWS Glue built-in patterns to infer Day, Month, Monthday, Time & Year as a single entity. And using the custom pattern the grok pattern is defined.
Log type: AWS Glue built-in pattern WORD is used. WORD pattern will match any alphanumeric characters including underscore.
Email: There is no built-in pattern to parse email. Hence, the custom pattern is defined with the name GETEMAIL.
Message: GRREDYDATA built-in pattern is used. GREEDATA will match 0 or more of the preceding tokens except for the new line.
The overall pattern is defined as a part of the Final grok pattern.
Goto AWS Management Console → S3 Management Console → Create a new bucket or use an existing bucket to upload the log file.
As a next step, go to the AWS Glue console and create the database. The table definition will be created under this database.
Create custom classifier
I would encourage you to create the crawler first to run it without a custom classifier and check the results to get an idea if the AWS Glue built-in classifiers are able to populate the correct structure or not. Post-that go to Classifier (Under Crawlers) → Add classifier and configure the details as shown below and create.
Classifier name: The name of the custom classifier
Classifier type: Grok
Classification: A classification string that will be used as a label
Grok pattern: A grok pattern to match the structure (From Grok pattern table above). All the fields will have string data type but if you want to cast it to another data type then add the optional data type block (seperated by a colon as per the syntax which is discussed earlier)
Custom patterns: In the above table, two custom patterns (GETTIMESTAMP & GETEMAIL) are created which are not available as a part of AWS Glue built-in patterns, so enter that
Go to crawlers → Create crawler → Configure crawler name (Step 1) → Configure data source & add custom classifier(s) as shown below (Step 2) → Select IAM role (Step 3) → Select the database created earlier & provide the optional table prefix. In my case, it is custom- (Step 4) → Review the configuration (Step 5) → Create crawler.
For detailed steps on creating a crawler, please refer to this video
As a next step, run the crawler, and if the configuration and grok pattern is correct then it will populate the table under the said database.
Query the table via Athena
Navigate to Amazon Athena → Select Query Editor → Select AwsDataCatalog as Data Source → Select the database (In my case it’s apache-demo-db) → That should list all the tables under that database → Query the table
I hope, you gained good insights and hands-on knowledge about the custom classifier and especially the grok custom classifier. If you like to follow along with me step by step in detail then you can refer to this video.
If you have any questions, comments, or feedback then please leave them below. Subscribe to my channel for more.