- Newest
- Most votes
- Most comments
Hello,
Thank you very much for your questions. Please find below the answer to your questions:
-
Does the Glue crawler look at a subset of the 4,000 Parquet files? No, the Glue crawler does not sample or look at a subset of Parquet files when inferring the schema. According to the AWS Glue documentation, the crawler examines all the Parquet files in the specified path to infer the schema.
-
Is there a log I can find of which Parquet files were inspected for schema? Unfortunately, there is no specific log that lists the individual Parquet files inspected by the crawler for schema inference. The crawler logs only provide high-level information about the crawl process and any errors encountered.
-
Where can I find documentation on which files Glue decides to inspect? The AWS Glue documentation does not provide explicit details on how the crawler selects files for schema inference. However, it states that the crawler examines all files in the specified path and infers the schema based on the data types found in those files.
-
If not sampling, why is it not adding some of the columns into the table schema? If the Glue crawler is not adding certain columns to the table schema, it could be due to one of the following reasons:
a. Inconsistent data types: If the same column name has different data types across the Parquet files, the crawler may choose one data type over the others, potentially excluding some columns from the schema.
b. Nested or complex data types: If the columns have nested or complex data types (e.g., structs, arrays), the crawler may not handle them correctly, leading to missing columns in the schema.
c. Crawler configuration: Ensure that the crawler configuration, such as the
SchemaChangePolicy
, is set correctly to include new columns in the schema.
More information: https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html
Relevant content
- asked 5 months ago
- asked 8 months ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 5 months ago