Why does Glue crawler omit some of my Parquet-file columns?

0

I have a bunch of parquet files in a flat S3 folder, no partitions:

s3://my-bucket/my_folder1/my_table/file1.parquet
s3://my-bucket/my_folder1/my_table/file2.parquet
...
s3://my-bucket/my_folder1/my_table/file4000.parquet

The Parquet file schemas are not consistent - many them have different columns than others. Like maybe 300 of them have columnA and 2000 of them have columnB. I do not know what the columns will be ahead of time and would like to rely on the crawler to add new ones. I created a crawler that looks like this:

I want to use a Glue crawler to create a table with ALL the columns of ALL the parquet files.

$ aws glue get-crawler --name "my-crawler"
{
    "Crawler": {
        "Name": "my-crawler",
        "Role": "my-role",
        "Targets": {
            "S3Targets": [
                {
                    "Path": "s3://my-bucket/my_folder1/my_table/",
                    "Exclusions": []
                }
            ],
            "JdbcTargets": [],
            "MongoDBTargets": [],
            "DynamoDBTargets": [],
            "CatalogTargets": [],
            "DeltaTargets": [],
            "IcebergTargets": [],
            "HudiTargets": []
        },
        "DatabaseName": "my-db",
        "Classifiers": [],
        "RecrawlPolicy": {
            "RecrawlBehavior": "CRAWL_EVERYTHING"
        },
        "SchemaChangePolicy": {
            "UpdateBehavior": "UPDATE_IN_DATABASE",
            "DeleteBehavior": "DEPRECATE_IN_DATABASE"
        },
        "LineageConfiguration": {
            "CrawlerLineageSettings": "DISABLE"
        },
        "State": "READY",
        "CrawlElapsedTime": 0,
        "CreationTime": "2024-07-29T17:44:32-07:00",
        "LastUpdated": "2024-07-29T17:44:32-07:00",
        "LastCrawl": {
            "Status": "SUCCEEDED",
            "LogGroup": "/aws-glue/crawlers",
            "LogStream": "my-crawler",
            "MessagePrefix": "xxxx",
            "StartTime": "2024-07-30T09:03:17-07:00"
        },
        "Version": 1,
        "LakeFormationConfiguration": {
            "UseLakeFormationCredentials": false,
            "AccountId": ""
        }
    }
}

The crawler creates the table my_table with a schema containing SOME but not ALL of the columns. For example contains columnB but not columnA.

My questions:

  • Does the Glue crawler look at a subset of the 4,000 parquets (even though I didn't specify any sampling in the configuration)? I can't find any documentation about how many Parquet files are examined (I have found the 1000 record / 1 MB limit for JSON and CSV files but nothing about Parquet)
    • If so, is there a log I can find of which of the Parquets were inspected for schema?
    • Also if so, where can I find documentation of which files Glue decides to inspect?
    • If not, why is it not adding some of the columns into the table schema?
ecmons
asked 2 months ago177 views
1 Answer
1

Hello,

Thank you very much for your questions. Please find below the answer to your questions:

  1. Does the Glue crawler look at a subset of the 4,000 Parquet files? No, the Glue crawler does not sample or look at a subset of Parquet files when inferring the schema. According to the AWS Glue documentation, the crawler examines all the Parquet files in the specified path to infer the schema.

  2. Is there a log I can find of which Parquet files were inspected for schema? Unfortunately, there is no specific log that lists the individual Parquet files inspected by the crawler for schema inference. The crawler logs only provide high-level information about the crawl process and any errors encountered.

  3. Where can I find documentation on which files Glue decides to inspect? The AWS Glue documentation does not provide explicit details on how the crawler selects files for schema inference. However, it states that the crawler examines all files in the specified path and infers the schema based on the data types found in those files.

  4. If not sampling, why is it not adding some of the columns into the table schema? If the Glue crawler is not adding certain columns to the table schema, it could be due to one of the following reasons:

    a. Inconsistent data types: If the same column name has different data types across the Parquet files, the crawler may choose one data type over the others, potentially excluding some columns from the schema.

    b. Nested or complex data types: If the columns have nested or complex data types (e.g., structs, arrays), the crawler may not handle them correctly, leading to missing columns in the schema.

    c. Crawler configuration: Ensure that the crawler configuration, such as the SchemaChangePolicy, is set correctly to include new columns in the schema.

More information: https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html

AWS
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions