Why does the AWS Glue crawler classify my fixed-width data file as UNKNOWN when I use a built-in classifier to parse the file?

3 minute read
0

When I parse a fixed-width .dat file with a built-in classifier, my AWS Glue crawler classifies the file as UNKNOWN.

Short description

Built-in classifiers can't parse fixed-width data files. Use a grok custom classifier instead.

Resolution

Create the grok custom classifier

1.    Open the AWS Glue console.

2.    In the navigation pane, choose Classifiers.

3.    Choose Add classifier, and then enter the following:
For Classifier name, enter a unique name.
For Classifier type, choose Grok.
For Classification, enter a description of the format or type of data that is classified, such as "special-logs."
For Grok pattern, enter the built-in patterns that you want AWS Glue to use to find matches in your data. To parse a .dat file, no delimiter is required between fields. Because each field has a known length, you can use a regex pattern to find matches.
Example:

(?<col0>.{7})(?<col1>.{8})(?<col2>.{14})(?<col3>.{52})

(Optional) For Custom patterns, enter any custom patterns that you want to use. These patterns are referenced by the grok pattern that classifies your data. Each custom pattern must be on a separate line. For more information, see Custom classifier values in AWS Glue.

4.    Choose Create.

Create and run the crawler

1.    In the navigation pane, choose Crawlers.

2.    Choose Add crawler.

3.    For Crawler name, enter a unique name.

4.    Choose the arrow next to the Tags, description, security configuration, and classifiers (optional) section, and then find the Custom classifiers section.

5.    Choose Add next to the customer classifier that you created earlier, and then choose Next.

6.    On the Specify crawler source type page, choose Data stores, and then choose Next.

7.    On the Add a data store page, enter the following:
For Choose data store, choose your preferred data store.
For Include path, enter the path to your .dat file.

8.    Choose Next,and then confirm whether or not you want to add another data store.

9.    On the Choose an IAM role page, select an existing AWS Identity and Access Management (IAM) role or create a new one. Choose Next.

10.    For Frequency, choose Run on demand, and then choose Next.

11.    On the Configure the crawler's output page, for Database, choose the database that you want the table to be created in. Choose Next.

12.    Choose Finish to create the crawler.

13.    When the crawler status changes to Ready, select the crawler name, and then choose Run crawler.

14.    Wait for the crawler to finish, and then choose Tables in the navigation pane. The Classification must match the classification that you entered for the grok custom classifier (for example, "special-logs").


Related information

Working with classifiers on the AWS Glue Console

Writing grok custom classifiers

Adding classifiers to a crawler

AWS OFFICIAL
AWS OFFICIALUpdated 3 years ago