Finding the right big data solution

0

I work for a company that generates large amounts of business and sensor data and stores it in different databases (Prometheus, InfluxDB, Postgres, Timestream). Currently the querying for analytics poses challenges as the data is extracted from production, and very large queries can cause performance issues (we are also seperately working on making the infra more scaleable).

I am looking to solve this data problem but also create a data infrastructure that is future ready for when we want to do a lot more with the data (AI, ML a.e.). The backups for these databases are stored on S3 buckets, and for now I created a seperate database server that restores these backups and can be queried, which works for now. But I see a lot of solutions online which I want to research and try out. For now I have:

  1. Athena + Glue, can I ETL the backup files and query these, or do I still need to setup Glue jobs from the EC2?
  2. AWS QuickSight
  3. AWS + Databricks
  4. AWS + Snowflake
  5. AWS Redshift

It should be worth mentioning that data is structured and unstructured, and I am not a data analist. I just want to setup a prototype for a useable structure, so for now ease of implementation is also nice to have. I will be doing my own research and already did some, but what setup would you recommend to me? Do you have any other recommendations that are worth researching? Thanks in advance.

2 Answers
0

Hi D Joe,

I understand you are working with datasets generated from multiple sources and face challenges with data extracts and ongoing production workloads. It sounds like you've tried one pattern of loading the database dumps into a newly provisioned server to perform analytics. Though this works for now, it comes with constraints in performance and access management when considering scalability.

As you've mentioned that your company generates large amounts of data in both structured and unstructured formats, this seems like a good use case for building a data lake on AWS. I cannot speak for options 3 or 4 but a common pattern I have seen leverages 1, 2, and 5 to implement a lake house architecture (added a picture and links below for reference). I understand you are looking to prototype a pattern so not all services displayed need to be used

To answer your question in Option 1): assuming these are Postgres backups, it would depend on what file format these backups were configured as. You can find the file format options which Glue can work with here. Not sure if your DBs are on RDS or on-premises, but there is an extension for RDS Postgres for exporting to S3 where you can also control the output file format. Once, the files are in a valid format, an AWS Glue crawler can be used to scan and catalog the data into a table which will be queryable by Amazon Athena (more scalable than a single RDS instance). This can be done on both unstructured and structured data and to optimize queries you can leverage Glue again to transform and partition the data if needed. Also, Glue is completely serverless and you won't need to use the EC2 console for developing the jobs. It can be done on the Glue console and includes visual low-code options too

Amazon Redshift is typically introduced once you have a data warehousing need i.e. thousands of queries being fired, complex ETL and stored procedures, etc. Depending on the amount of queries and complexity, Redshift may be a more cost-effective approach as Athena charges by the data scanned in each query. Redshift Spectrum is also another capability which allows you to query structured and semistructured in S3 without needing to load the data into the cluster too.

If you have a need for data visualization and reporting, Amazon QuickSight is a great tool to use. As it's fully integrated with the other services, a common pattern I have seen is QuickSight used with Athena to extract/query data from the tables created in the Glue Data Catalog (after data in S3 has been crawled).

Let me know if you have any questions around this. Additional links to reference: https://docs.aws.amazon.com/whitepapers/latest/best-practices-building-data-lake-for-games/lake-house-architecture.html https://aws.amazon.com/blogs/big-data/harmonize-query-and-visualize-data-from-various-providers-using-aws-glue-amazon-athena-and-amazon-quicksight/ https://aws.amazon.com/blogs/big-data/build-and-automate-a-serverless-data-lake-using-an-aws-glue-trigger-for-the-data-catalog-and-etl-jobs/

Regards, Loc Data Architect, AWS ProServeEnter image description here

AWS
Loc
answered a year ago
  • Thanks for your reply! I am figuring out Glue and Athena as we speak and am coming along quite nicely. I fail to understand the workflow for now though.

    I have created a crawler that crawls my postgres schema with JDBC and this is working quite nicely. Now I am confused about the next steps:

    1. Do I create an etl job that writes my postgres content to a bucket daily? If so, how do I query it with Athena?
    2. Do I use athena with a postgres datasource?
    3. Do I use athena with a AWS Glue Data Catalog?

    Also for my next step in the near future: I am also wondering what the best course of action is for me to set up an ETL job for an InfluxDB2 server running on an ec2. I am not finding much online. I am guessing crawlers wont work in discovering any "schema" Thanks for you detailed answer by the way I appreciate it.

0

Hey D Joe,

Glad to hear you made some progress and successfully hooked up Glue with Postgres.

Regarding your first three questions, I would recommend using Athena with the AWS Glue Data Catalog which is now populated with one table (your Postgres). This allows you to easily populate and update the Catalog whenever you kick off the crawlers for new/updated sources. Once, the Glue Data Catalog is updated, you can start querying with Athena. You can use Athena with that Postgres datasource but to avoid the performance issue of analytics queries competing with the other production workloads it would be better to write up that Glue ETL job to extract data from Postgres into S3. Now that your Postgres connection is already added, you can do this easily with the visual editor in the Glue console too. In order to query this S3 data with Athena, you can configure the crawler to hit this S3 path and populate the catalog. If you'd like to orchestrate this steps together, take a look at Glue workflows (this blog might give you some ideas)

Unfortunately, I haven't tried setting up crawlers with InfluxDB2 but perhaps try creating a Glue connection JDBC connection to it and test? I don't see it listed under the documentation however https://docs.aws.amazon.com/glue/latest/dg/crawler-data-stores.html

Keep me posted and feel free to give me a thumbs if this works for you :)

Thanks, Loc

AWS
Loc
answered a year ago
  • Hey Loc! Coming along further everyday.

    So I have CDKed a pretty basic Glue setup. We have a PostgreSQL database, in the CDK code I do the following:

    1. Import VPC

    2. Create CfnConnection with JDBC

    3. Create CfnCrawler using the connection

    This works nicely, I can run the crawler and it succeeds. It writes down all the PostgreSQL tables and metadata to a data catalog/database. Cool! All good.

    Next step I am doing manually now: I want to create an ETL flow that writes from the Database to my S3 data lake. I create a ETL job via Glue: Data Source (table 'Example') > ApplyMapping > S3 Bucket as Parquet. Also works nicely! I see data in the bucket, and I am even able to see it as a table in Athena, and the query shows the correct count and data. Awesome.

    Now I have trouble CDKing the last part, if I understand correctly the crawler generates a Data Catalog. If my colleagues add a extra table to postgres today, the crawler will notice and update the Data Catalog. The rest of my flow uses the Data Catalog so all good. I CDKed a Glue job today with glue.CfnJob, I can only attach a script location, which works with a local .py script. But this script wont get updated if a colleague adds a new table.

    So it seems to me it is either a static .py script with cdk, or a manual ETL job that gets updated dynamically. Or am I missing something? I need to write the data to my S3 data lake right? Or is Glue not the right service for this?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions