In-place query of S3 data without provisioning DB or creating tables

0

We are exploring usecases where we want to achieve in-place transformation and querying of S3 data lake data. We don't want to provision database and create tables (so we are not keen to consider Redshift or Athena) and we want the querying to be most cost-efficient. While we can use S3 Select to directly query S3 data, it has its own limitations such as we can query single object using S3 Select, etc. Are there any alternatives to achieve this? Please guide

Mayura
asked 2 years ago887 views
5 Answers
1

In response to your comment on my previous answer I would like to point out that it is not the case that you need to create an Athena table for each individual S3 object.

As per https://docs.aws.amazon.com/athena/latest/ug/tables-location-format.html, the LOCATION refers to an S3 folder, and (at query time)

Athena reads all data stored in the Amazon S3 folder that you specify.

To use Athena in a practical way, you declare tables once per folder (of files with identical type and schema) and add (or remove) data at any time by writing (or removing) S3 objects in that folder.

SMP
answered 2 years ago
  • the answer is correct, if the user is worried about the need to create the schema definition for the table, probably AWS Glue Crawlers could be an answer to that enabling them to automatically discover the data from the CSV , JSON, and Parquet Files.

0

Seems like you have deprived yourselves of all the options. Athena would be an obvious choice, although you have not disclosed any details around the format and structure of the files you will be querying. What is most cost-efficient depends on the size of data and your usage-pattern, none of which you have included any information on.

What are your concerns around the necessity to maintain tables in Athena (Glue)? You will not incur any cost for these. Think of them as a trade-off that gives you a higher-level access to your data, i.e. not having to re-declare the data structure as well as being able to query without consideration to any single S3 objects.

SMP
answered 2 years ago
AWS
EXPERT
reviewed 2 years ago
  • Thank you very much SMP. I will explore more on Athena (Glue). In Athena, we need to create table for every S3 object, hence we were exploring if there are other options. But I get your point.

    Our use case is to generate reports on S3 data lake data. S3 may contain CSV, Parquet and highly nested JSON data. Client wants to make this reporting cost-efficient and simple. (Simple in the sense that they want to explore options where there is no need provision DB/create tables). I'm not sure of the latency requirements. Client is also keen to use Hudi wherever possible in many of their modules.

0

in addition to what have already been commented, you can also query S3 data using Spark for example, so it would give you additional options such as Glue, EMR ( EMR Serverless) or EMR on EKS.

AWS
Alex_T
answered 2 years ago
  • Thank you Alex. Yes, right, Spark is a useful option.

    As my original question was vague, I will try to complete it below - Our use case is to generate reports on S3 data lake data. S3 may contain CSV, Parquet and highly nested JSON data. Client wants to make this reporting cost-efficient and simple. (Simple in the sense that they want to explore options where there is no need provision DB/create tables). I'm not sure of the latency requirements. Client is also keen to use Hudi wherever possible in many of their modules.

0

If I understand the requirements, you are looking for:

  1. a cost effective solution
  2. ability to query the data in place on the data lake (Amazon S3)
  3. ability to transform the data in place on the data lake (Amazon S3)
  4. generate reports
  5. simplicity no need to maintain a database neither the schema design for the tables.

Based on that principle a serverless service that allow you to query the data and even transform it would be the best option, from an execution perspective.

So as already mentioned Amazon Athena would actually match perfectly your requirements ( see Athena supported file formats , it also support HUDI). On top of that if you do not want to maintain the tables schema nor create the tables you could use AWS Glue Crawlers to do that for you automatically. The Tables discovered by AWS Glue are generated in the AWS Glue Catalog that is accessed transparently by Amazon Athena.

To Query Nested JSON with Athena you can look at this blog post, if you used Glue to crawl the data you would not need to create the table manually.

If you are trying to generate report easily for the users, query complex nested structure could be too difficult in that case you could use Glue transformations to relationalize the data.

hope this helps

AWS
EXPERT
answered 2 years ago
0

Thank you Fabrizio. I agree and understand. Our client is somehow still expecting us to come up with options other than Athena/Redshift Spectrum, that's the reason we are exploring. Do you know about S3 connectors? I know that S3 connector integrated with Quicksight/Tableau cannot replace Athena/Redshift, but are there any use cases where we could use S3 data directly using reporting tools like Tableau or Qlik sense using connectors?

Mayura
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions