1回答
- 新しい順
- 投票が多い順
- コメントが多い順
0
Hi - You can either use "Query with S3 Select" option from the S3 console if the compressed file size is less than 140 MB Or use the s3api (https://docs.aws.amazon.com/cli/latest/reference/s3api/select-object-content.html) CLI to validate if the parquet file is a valid one.
aws s3api select-object-content \
--bucket my-bucket \
--key my-data-file.parquet \
--expression "select * from s3object limit 100" \
--expression-type 'SQL' \
--input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' \
--output-serialization '{"JSON": {}}' "output.json"
Another option is to use AWS Glue Crawler to load the parquet file and query via Athena - https://docs.aws.amazon.com/glue/latest/ug/tutorial-add-crawler.html
関連するコンテンツ
- 質問済み 6年前
- AWS公式更新しました 3年前
- AWS公式更新しました 8ヶ月前
- AWS公式更新しました 2年前
Thanks Gokul. But, I'm not able to read the parquet file using S3 select in the coonsole or form API. In S3 select - it says "Successfully returned 0 records" (the file size is 40MB). In AWS CLI, the output is always "aws command usage option", no output or error. No error is displayed in both cases. How do I figure out if the file is invalid? Why is the file not being read? .
This is the error we get -
An error was encountered: An error occurred while calling o91.parquet. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7) (ip-xx.xx.xxx.awscorp.siriusxm.com executor 11): org.apache.spark.sql.AnalysisException: Parquet type not yet supported: INT32 (TIME_MILLIS) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.typeNotImplemented$1(ParquetSchemaConverter.scala:104)