Not able to read S3 Parquet file

Hi Team, I'm trying to read Parquet files in S3, but I get the following error. Please help. I'm not sure if the data inside the parquet file is corrupt or I'm unable to read the file due to datatype mismatch. Any help would be much appreciated.

df = spark.read.parquet("s3://xxxxxxx/edo_sms_replica_us_stg/event_t/TESTFILES/LOAD00000CAD.parquet") An error was encountered: Invalid status code '404' from http://ip-xx.xx.xx..awscorp.siriusxm.com:8998/sessions/168 with error payload: {"msg":"Session '168' not found."}

トピック

ストレージ

タグ

Amazon Simple Storage Service (S3)

言語

English

Mayura

質問済み 2年前4168ビュー

1回答

新しい順
投票が多い順
コメントが多い順

これらの回答は役に立ちましたか？コミュニティがあなたの知識を活用できるように、正解に賛成票を投じてください。

Hi - You can either use "Query with S3 Select" option from the S3 console if the compressed file size is less than 140 MB Or use the s3api (https://docs.aws.amazon.com/cli/latest/reference/s3api/select-object-content.html) CLI to validate if the parquet file is a valid one.

aws s3api select-object-content \
    --bucket my-bucket \
    --key my-data-file.parquet \
    --expression "select * from s3object limit 100" \
    --expression-type 'SQL' \
    --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' \
    --output-serialization '{"JSON": {}}' "output.json"

Another option is to use AWS Glue Crawler to load the parquet file and query via Athena - https://docs.aws.amazon.com/glue/latest/ug/tutorial-add-crawler.html

エキスパート

Gokul

回答済み 2年前

Mayura
2年前
Thanks Gokul. But, I'm not able to read the parquet file using S3 select in the coonsole or form API. In S3 select - it says "Successfully returned 0 records" (the file size is 40MB). In AWS CLI, the output is always "aws command usage option", no output or error. No error is displayed in both cases. How do I figure out if the file is invalid? Why is the file not being read? .
Mayura
2年前
This is the error we get -

An error was encountered: An error occurred while calling o91.parquet. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7) (ip-xx.xx.xxx.awscorp.siriusxm.com executor 11): org.apache.spark.sql.AnalysisException: Parquet type not yet supported: INT32 (TIME_MILLIS) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.typeNotImplemented$1(ParquetSchemaConverter.scala:104)

Not able to read S3 Parquet file

関連するコンテンツ