RDS Data ETL & data transfer

0

I'm seeking guidance on a specific requirement and the recommended approach to achieve it.

My data is currently stored in an AWS Aurora SQL database (let's say host1/db1) . The objective is to duplicate all the tables (the entire database) into a new Aurora SQL database (host1/db2) on the same host. Before transferring the data, there is a need for certain data transformations, updating certain data values.

Should I opt for AWS Glue to facilitate this process, extracting data from RDS and subsequently importing it back into RDS? Would the workflow be as follows:

RDS -> Extract Data from RDS using AWS Glue -> Perform ETL -> Insert Data from AWS Glue into RDS (Will the data extracted from RDS being temporarily stored in Glue)

Alternatively, do I need to utilize an intermediary service like S3? Would the process look like this:

RDS -> Extract Data from RDS using AWS Glue -> Store the extracted data in S3 -> Execute ETL -> Import Data from S3 into RDS.

Should I need to consider employing a script for exporting data via 'mysqldump,' transferring it to the new database, and subsequently executing data updates using SQL queries?

Or a different approach is required

Your guidance on the best approach would be greatly appreciated.

  • Is this a one-time ETL job, or an ongoing one?

  • The task will be performed occasionally to transfer data from one database to another on the same host after transformation.

2 Answers
0

If you merely want to copy the full database to a new host, and then change some of the data values and such, you may be best served by using mysqldump and dumping the source database, then importing it to the new host, where you run a query to make the transformations and update the values you want. this task?

AWS
EXPERT
answered a year ago
  • A new host will not be established; instead, we will utilize an existing host. The existing database will serve as the foundation for creating a new database after the necessary transformations, and both databases will coexist on the same host.

0

From a cost and performance perspective, I would suggest you follow an approach where "everything" happens within the host. Serverless models like AWS Glue have a pay-per-use billing model where you pay for each byte of data that is processed. Moving data between RDS, EC2 and S3 would also involve data transfer cost.

If I were you, I would:

  1. Create a Stored Procedure in db1 that does the data transformation and load the data in db2.
  2. If the data load needs to happen instantly when data changes in db1, then write triggers in source tables to invoke the stored procedure.
  3. If the data needs to be updated periodically, then find a way to invoke the stored procedure at fixed intervals. For example: a) using an ETL package running on an EC2 instance or b) a Lambda function (triggered by an EventBridge Schedueler) that executes the stored procedure.
Shameel
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions