Not able to read S3 Parquet file

0

Hi Team, I'm trying to read Parquet files in S3, but I get the following error. Please help. I'm not sure if the data inside the parquet file is corrupt or I'm unable to read the file due to datatype mismatch. Any help would be much appreciated.

df = spark.read.parquet("s3://xxxxxxx/edo_sms_replica_us_stg/event_t/TESTFILES/LOAD00000CAD.parquet") An error was encountered: Invalid status code '404' from http://ip-xx.xx.xx..awscorp.siriusxm.com:8998/sessions/168 with error payload: {"msg":"Session '168' not found."}

Mayura
已提问 2 年前4030 查看次数
1 回答
0

Hi - You can either use "Query with S3 Select" option from the S3 console if the compressed file size is less than 140 MB Or use the s3api (https://docs.aws.amazon.com/cli/latest/reference/s3api/select-object-content.html) CLI to validate if the parquet file is a valid one.

aws s3api select-object-content \
    --bucket my-bucket \
    --key my-data-file.parquet \
    --expression "select * from s3object limit 100" \
    --expression-type 'SQL' \
    --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' \
    --output-serialization '{"JSON": {}}' "output.json"

Another option is to use AWS Glue Crawler to load the parquet file and query via Athena - https://docs.aws.amazon.com/glue/latest/ug/tutorial-add-crawler.html

AWS
专家
Gokul
已回答 2 年前
  • Thanks Gokul. But, I'm not able to read the parquet file using S3 select in the coonsole or form API. In S3 select - it says "Successfully returned 0 records" (the file size is 40MB). In AWS CLI, the output is always "aws command usage option", no output or error. No error is displayed in both cases. How do I figure out if the file is invalid? Why is the file not being read? .

  • This is the error we get -

    An error was encountered: An error occurred while calling o91.parquet. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7) (ip-xx.xx.xxx.awscorp.siriusxm.com executor 11): org.apache.spark.sql.AnalysisException: Parquet type not yet supported: INT32 (TIME_MILLIS) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.typeNotImplemented$1(ParquetSchemaConverter.scala:104)

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则