- Newest
- Most votes
- Most comments
The overall process I can think of goes through this points:
- Create a Unified View across all schemas to act as a single table containing all Users data.
- Incrementally Load Data using the INSERT INTO ... SELECT * FROM ... SQL command, filtering out already inserted rows.
- Update Existing Data using the UPDATE SQL command, joining source and destination tables on the primary key.
- Automate the Process with a stored procedure or script running at regular intervals, scheduled via AWS Glue, AWS Lambda, or another job scheduling service.
Let me know if this is helpful and accept the answer if ok.
Thank you. I have tested an INSERT and UPDATE SP, but that process is not optimal with 500M records in each Users table. Also, in the unified table, the records take on a new primary key to avoid conflicts. So, the UPDATE command has to do a lookup on 2 columns (original_id,schema_id) on 50B rows. It's a very costly procedure.
Zero-ETL appends 2 additional columns to the redshift tables (padb_internal_txn_seq_col, padb_internal_txn_id_col). Does anyone know how I can use those values to filter the inserts and updates to only the latest record changes in each schema to modify the unified table?
First, you can identify the latest records in each schema by using these internal columns. You'll need a tracking table to store the last processed transaction ID for each schema.
By comparing the current transaction IDs with those stored in the tracking table, you can filter out only the new changes.
For inserting new records, you add only those that aren't already in the unified table.
For updating existing records, you match records based on (original_id, schema_id) and update the necessary fields.
You can automate this by creating a stored procedure or a script to handle the insertions and updates. Schedule this script using AWS Glue, AWS Lambda, or a similar job scheduling service to run at regular intervals and you're done.
When a table has to be resynced, do the txids continue where they left off? I may need to add something to the process that clears the tracking table when a table goes into a resync state. Is there any other issues that you think I will need to consider?
Relevant content
- asked 9 months ago
- asked 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated 2 years ago