Enforce column type in Glue Crawler

Question

Hello,
I have parquets files in S3 that i parse using Glue Crawler and query in Athena. I found that some files have two columns "x" and "y" that have a type **int64** while other files have them as **double**. 
So when I query in Athena it returns an error for type mismatch. Is there a way to enforce a type **double** on these fields when using Glue ? I have thousands of files and I can't know which ones have the type int64 for these columns.

Answer

Yes, you can enforce a specific column type during the crawling process using AWS Glue. You can achieve this by creating a custom schema for your table in the Glue Data Catalog. Here's how you can do it:

1. **Create a Custom Schema**:
   - Go to the AWS Glue console and navigate to the Tables section.
   - Find the table corresponding to your Parquet files and click on it to view its details.
   - In the table details page, click on the "Edit schema" button.
   - Edit the schema to enforce the desired data types for the "x" and "y" columns. You can specify "double" as the data type for these columns.

2. **Re-run the Glue Crawler**:
   - After updating the schema, re-run the Glue Crawler to re-crawl your S3 data and update the metadata in the Glue Data Catalog.
   - The Glue Crawler will now enforce the specified data types for the "x" and "y" columns during the crawling process.

3. **Query in Athena**:
   - Once the crawling process is complete, you can query the data in Athena. The columns "x" and "y" will now have the data type "double" enforced, ensuring consistency across all files.

By enforcing the column types in the Glue Data Catalog, you ensure consistency in the data types when querying the data in Athena, even if individual Parquet files have variations in their schema. This approach allows you to handle variations in the data seamlessly without needing to manually inspect each file.

Enforce column type in Glue Crawler

Relevant content