Incrementally update Redshift data warehouse from zero-ETL Redshift database

0

I have 100+ Aurora mySQL schemas that all have a Users table. These use zero-ETL to copy the data into a Redshift zero-ETL database.

  1. What is the most efficient and trouble-proof way to append all the Users data into one table in another Redshift database?

  2. I would like to incrementally insert and update the data from the zero-ETL Redshift database to the Data Warehouse database. (From there, we will be using dbt to incrementally load our warehouse tables.) Is there a built-in process to do this?

4 Answers
1

The overall process I can think of goes through this points:

  • Create a Unified View across all schemas to act as a single table containing all Users data.
  • Incrementally Load Data using the INSERT INTO ... SELECT * FROM ... SQL command, filtering out already inserted rows.
  • Update Existing Data using the UPDATE SQL command, joining source and destination tables on the primary key.
  • Automate the Process with a stored procedure or script running at regular intervals, scheduled via AWS Glue, AWS Lambda, or another job scheduling service.

Let me know if this is helpful and accept the answer if ok.

profile picture
EXPERT
answered 5 months ago
0

Thank you. I have tested an INSERT and UPDATE SP, but that process is not optimal with 500M records in each Users table. Also, in the unified table, the records take on a new primary key to avoid conflicts. So, the UPDATE command has to do a lookup on 2 columns (original_id,schema_id) on 50B rows. It's a very costly procedure.

Zero-ETL appends 2 additional columns to the redshift tables (padb_internal_txn_seq_col, padb_internal_txn_id_col). Does anyone know how I can use those values to filter the inserts and updates to only the latest record changes in each schema to modify the unified table?

answered 5 months ago
0

First, you can identify the latest records in each schema by using these internal columns. You'll need a tracking table to store the last processed transaction ID for each schema.

By comparing the current transaction IDs with those stored in the tracking table, you can filter out only the new changes.

For inserting new records, you add only those that aren't already in the unified table.

For updating existing records, you match records based on (original_id, schema_id) and update the necessary fields.

You can automate this by creating a stored procedure or a script to handle the insertions and updates. Schedule this script using AWS Glue, AWS Lambda, or a similar job scheduling service to run at regular intervals and you're done.

profile picture
EXPERT
answered 5 months ago
0

When a table has to be resynced, do the txids continue where they left off? I may need to add something to the process that clears the tracking table when a table goes into a resync state. Is there any other issues that you think I will need to consider?

answered 5 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions