How to query large parquet file


Hi ,

We have a large parquet file which is stored in S3 storage. It contains the data for 6 months of logs or even more. As as service we need to run a query to fetch the user data from parquet file. It has to be done end-to-end using aws data apis.

I have explored the redshift but the challenge is data loading from s3 to redshift needs additional effort such as provisioning a cluster and copying data from S3 to redshift data base.

I have the following query.

  1. Redshift provisioning can be done using aws data APIs ?
  2. How to run the query to fetch user data from redshift database. Is any data apis available for this?
  3. Is there any better aws service is available other than redshift ?

Regards, Ashok

asked 2 months ago42 views
1 Answer

You should run a Glue crawler against the S3 location of your parquet dataset. Then you will be able to query the data using Athena. You can use the console for Athena or APIs to run your queries.

answered 2 months ago
profile picture
reviewed 2 months ago
  • Thank you for reply. What are the challenge in using the redshift serverless.?

  • Redshift Serverless is an option, you could use it to query the data in s3 by running the crawler and then create an external schema. If you need better performance you could COPY it in the serverless endpoint. There are cost differences with Athena and Redshift Serverless that should be compared.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions