Enforce column type in Glue Crawler

0

Hello, I have parquets files in S3 that i parse using Glue Crawler and query in Athena. I found that some files have two columns "x" and "y" that have a type int64 while other files have them as double. So when I query in Athena it returns an error for type mismatch. Is there a way to enforce a type double on these fields when using Glue ? I have thousands of files and I can't know which ones have the type int64 for these columns.

  • submitted an answer, i hope you accept it !

Mehdi
質問済み 1ヶ月前232ビュー
1回答
0

Yes, you can enforce a specific column type during the crawling process using AWS Glue. You can achieve this by creating a custom schema for your table in the Glue Data Catalog. Here's how you can do it:

  1. Create a Custom Schema:

    • Go to the AWS Glue console and navigate to the Tables section.
    • Find the table corresponding to your Parquet files and click on it to view its details.
    • In the table details page, click on the "Edit schema" button.
    • Edit the schema to enforce the desired data types for the "x" and "y" columns. You can specify "double" as the data type for these columns.
  2. Re-run the Glue Crawler:

    • After updating the schema, re-run the Glue Crawler to re-crawl your S3 data and update the metadata in the Glue Data Catalog.
    • The Glue Crawler will now enforce the specified data types for the "x" and "y" columns during the crawling process.
  3. Query in Athena:

    • Once the crawling process is complete, you can query the data in Athena. The columns "x" and "y" will now have the data type "double" enforced, ensuring consistency across all files.

By enforcing the column types in the Glue Data Catalog, you ensure consistency in the data types when querying the data in Athena, even if individual Parquet files have variations in their schema. This approach allows you to handle variations in the data seamlessly without needing to manually inspect each file.

Mustafa
回答済み 1ヶ月前
  • I did that but I still have the error "HIVE_BAD_DATA: Field location_id's type INT64 in parquet file s3://xxxxxxxxxxxx.parquet is incompatible with type double defined in table schema"

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ