Approach to pull data (>1M) periodically from a third party application's database for BI use cases within Redshift



I am considering Glue to connect to a third party application's database (Oracle) and bring a data set (in excess of 1M rows) obtained by joining multiple tables at source end. The destination for the data set will be stored in a AWS Redshift table. Further aggregations of data will be performed within Redshift for BI/Analytics.

I have two questions (1) Is the above approach cost effective in the long run? The process will be run daily.
(2) Is it advisable to bring (more) data from source tables as-is without performing any joins during extraction? Once in Redshift, perform the joins and aggregations within Redshift?

Thanks in advance,

S Ray

asked 3 months ago323 views
1 Answer

Depends a lot on your situation, the three options are valid.
If you can join and aggregate in Oracle is the most efficient if you just need that, as long as the DB has spare capacity for it.
If you have some Zero-ETL to move the tables into Redshift, that avoids adding a tool and then you can do it there.
And if you already have an ETL like Glue or need it for something else, then you could do it there and avoid pushing the workload the Redshift

profile pictureAWS
answered 3 months ago
  • Thanks Gonzalo. As we already use Glue for other ETLs, I think the third option is the most feasible. Agree, first is the best, but there are limitations as to how much leverage we have on running expensive queries on the third party owned oracle db server.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions