- Newest
- Most votes
- Most comments
Hi D Joe,
I understand you are working with datasets generated from multiple sources and face challenges with data extracts and ongoing production workloads. It sounds like you've tried one pattern of loading the database dumps into a newly provisioned server to perform analytics. Though this works for now, it comes with constraints in performance and access management when considering scalability.
As you've mentioned that your company generates large amounts of data in both structured and unstructured formats, this seems like a good use case for building a data lake on AWS. I cannot speak for options 3 or 4 but a common pattern I have seen leverages 1, 2, and 5 to implement a lake house architecture (added a picture and links below for reference). I understand you are looking to prototype a pattern so not all services displayed need to be used
To answer your question in Option 1): assuming these are Postgres backups, it would depend on what file format these backups were configured as. You can find the file format options which Glue can work with here. Not sure if your DBs are on RDS or on-premises, but there is an extension for RDS Postgres for exporting to S3 where you can also control the output file format. Once, the files are in a valid format, an AWS Glue crawler can be used to scan and catalog the data into a table which will be queryable by Amazon Athena (more scalable than a single RDS instance). This can be done on both unstructured and structured data and to optimize queries you can leverage Glue again to transform and partition the data if needed. Also, Glue is completely serverless and you won't need to use the EC2 console for developing the jobs. It can be done on the Glue console and includes visual low-code options too
Amazon Redshift is typically introduced once you have a data warehousing need i.e. thousands of queries being fired, complex ETL and stored procedures, etc. Depending on the amount of queries and complexity, Redshift may be a more cost-effective approach as Athena charges by the data scanned in each query. Redshift Spectrum is also another capability which allows you to query structured and semistructured in S3 without needing to load the data into the cluster too.
If you have a need for data visualization and reporting, Amazon QuickSight is a great tool to use. As it's fully integrated with the other services, a common pattern I have seen is QuickSight used with Athena to extract/query data from the tables created in the Glue Data Catalog (after data in S3 has been crawled).
Let me know if you have any questions around this. Additional links to reference: https://docs.aws.amazon.com/whitepapers/latest/best-practices-building-data-lake-for-games/lake-house-architecture.html https://aws.amazon.com/blogs/big-data/harmonize-query-and-visualize-data-from-various-providers-using-aws-glue-amazon-athena-and-amazon-quicksight/ https://aws.amazon.com/blogs/big-data/build-and-automate-a-serverless-data-lake-using-an-aws-glue-trigger-for-the-data-catalog-and-etl-jobs/
Regards, Loc Data Architect, AWS ProServe
Hey D Joe,
Glad to hear you made some progress and successfully hooked up Glue with Postgres.
Regarding your first three questions, I would recommend using Athena with the AWS Glue Data Catalog which is now populated with one table (your Postgres). This allows you to easily populate and update the Catalog whenever you kick off the crawlers for new/updated sources. Once, the Glue Data Catalog is updated, you can start querying with Athena. You can use Athena with that Postgres datasource but to avoid the performance issue of analytics queries competing with the other production workloads it would be better to write up that Glue ETL job to extract data from Postgres into S3. Now that your Postgres connection is already added, you can do this easily with the visual editor in the Glue console too. In order to query this S3 data with Athena, you can configure the crawler to hit this S3 path and populate the catalog. If you'd like to orchestrate this steps together, take a look at Glue workflows (this blog might give you some ideas)
Unfortunately, I haven't tried setting up crawlers with InfluxDB2 but perhaps try creating a Glue connection JDBC connection to it and test? I don't see it listed under the documentation however https://docs.aws.amazon.com/glue/latest/dg/crawler-data-stores.html
Keep me posted and feel free to give me a thumbs if this works for you :)
Thanks, Loc
Hey Loc! Coming along further everyday.
So I have CDKed a pretty basic Glue setup. We have a PostgreSQL database, in the CDK code I do the following:
-
Import VPC
-
Create CfnConnection with JDBC
-
Create CfnCrawler using the connection
This works nicely, I can run the crawler and it succeeds. It writes down all the PostgreSQL tables and metadata to a data catalog/database. Cool! All good.
Next step I am doing manually now: I want to create an ETL flow that writes from the Database to my S3 data lake. I create a ETL job via Glue: Data Source (table 'Example') > ApplyMapping > S3 Bucket as Parquet. Also works nicely! I see data in the bucket, and I am even able to see it as a table in Athena, and the query shows the correct count and data. Awesome.
Now I have trouble CDKing the last part, if I understand correctly the crawler generates a Data Catalog. If my colleagues add a extra table to postgres today, the crawler will notice and update the Data Catalog. The rest of my flow uses the Data Catalog so all good. I CDKed a Glue job today with glue.CfnJob, I can only attach a script location, which works with a local .py script. But this script wont get updated if a colleague adds a new table.
So it seems to me it is either a static .py script with cdk, or a manual ETL job that gets updated dynamically. Or am I missing something? I need to write the data to my S3 data lake right? Or is Glue not the right service for this?
-
Relevant content
- asked 2 years ago
- asked 2 years ago
- asked 10 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 9 months ago
Thanks for your reply! I am figuring out Glue and Athena as we speak and am coming along quite nicely. I fail to understand the workflow for now though.
I have created a crawler that crawls my postgres schema with JDBC and this is working quite nicely. Now I am confused about the next steps:
Also for my next step in the near future: I am also wondering what the best course of action is for me to set up an ETL job for an InfluxDB2 server running on an ec2. I am not finding much online. I am guessing crawlers wont work in discovering any "schema" Thanks for you detailed answer by the way I appreciate it.