Enforce column type in Glue Crawler

0

Hello, I have parquets files in S3 that i parse using Glue Crawler and query in Athena. I found that some files have two columns "x" and "y" that have a type int64 while other files have them as double. So when I query in Athena it returns an error for type mismatch. Is there a way to enforce a type double on these fields when using Glue ? I have thousands of files and I can't know which ones have the type int64 for these columns.

  • submitted an answer, i hope you accept it !

Mehdi
asked 12 days ago197 views
1 Answer
0

Yes, you can enforce a specific column type during the crawling process using AWS Glue. You can achieve this by creating a custom schema for your table in the Glue Data Catalog. Here's how you can do it:

  1. Create a Custom Schema:

    • Go to the AWS Glue console and navigate to the Tables section.
    • Find the table corresponding to your Parquet files and click on it to view its details.
    • In the table details page, click on the "Edit schema" button.
    • Edit the schema to enforce the desired data types for the "x" and "y" columns. You can specify "double" as the data type for these columns.
  2. Re-run the Glue Crawler:

    • After updating the schema, re-run the Glue Crawler to re-crawl your S3 data and update the metadata in the Glue Data Catalog.
    • The Glue Crawler will now enforce the specified data types for the "x" and "y" columns during the crawling process.
  3. Query in Athena:

    • Once the crawling process is complete, you can query the data in Athena. The columns "x" and "y" will now have the data type "double" enforced, ensuring consistency across all files.

By enforcing the column types in the Glue Data Catalog, you ensure consistency in the data types when querying the data in Athena, even if individual Parquet files have variations in their schema. This approach allows you to handle variations in the data seamlessly without needing to manually inspect each file.

Mustafa
answered 12 days ago
  • I did that but I still have the error "HIVE_BAD_DATA: Field location_id's type INT64 in parquet file s3://xxxxxxxxxxxx.parquet is incompatible with type double defined in table schema"

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions