Not able to read S3 Parquet file

Question

Hi Team,
I'm trying to read Parquet files in S3, but I get the following error.
Please help.
I'm not sure if the data inside the parquet file is corrupt or I'm unable to read the file due to datatype mismatch.
Any help would be much appreciated.

df = spark.read.parquet("s3://xxxxxxx/edo_sms_replica_us_stg/event_t/TESTFILES/LOAD00000CAD.parquet")
An error was encountered:
Invalid status code '404' from http://ip-xx.xx.xx..awscorp.siriusxm.com:8998/sessions/168 with error payload: {"msg":"Session '168' not found."}

Answer

Hi - You can either use "Query with S3 Select" option from the S3 console if the compressed file size is less than 140 MB Or use the s3api (https://docs.aws.amazon.com/cli/latest/reference/s3api/select-object-content.html) CLI to validate if the parquet file is a valid one.

```
aws s3api select-object-content \
    --bucket my-bucket \
    --key my-data-file.parquet \
    --expression "select * from s3object limit 100" \
    --expression-type 'SQL' \
    --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' \
    --output-serialization '{"JSON": {}}' "output.json"
```

Another option is to use AWS Glue Crawler to load the parquet file and query via Athena - https://docs.aws.amazon.com/glue/latest/ug/tutorial-add-crawler.html

Not able to read S3 Parquet file

相关内容