DMS CDC to Redshift Consuming too much (Redshift) CPU

0

IHAC who is building its analytical platform on Redshift. One of their data source is Oracle RDS and they tend to replicate all the data and change of its 900 table to Redshift. The initial design is based on DMS CDC, that all the data table being put in 13 DMS tasks and started simutaneously on a DMS instance. However, a very high CPU consumption has been spotted on Redshift (a single ra3.xplus node as the data size is supposed to be just 1TB). And we are actively working with them to do the optimization.

Would the CPU usage be reduced (drastically) if we only keep -- let's say only 100 high volume tables -- in the DMS CDC, and using a periodical (let's say, hourly) DMS trunc and full load task for the other 800?

AWS
Lei
asked 9 months ago295 views
1 Answer
0
Accepted Answer

Redshift's ra3.xlplus instance type supports up to 32TB per node, but this is not the only thing you should be looking to decide the size of a cluster.

Redshift instances come with provisioned vCPU and Memory, which play an important role on cluster size.

The following are the available configs for all availables types and sizes as of today: Redshift Instances

Note that you "pseudo-cluster" has only 4 vCPU which might not be enough to handle all incoming changes from DMS, given the amount of tables, especially if these changes includes updates. Redshift has a real hard time dealing with updates, given that every time a record is updated it rewrites the entire 1MB data block on which the record's updated column value resides. Tables that are frequently updated/deleted requires frequent maintenance, a.k.a vacuum and analyze, to reclaim space and regain performance on reads.

All that said, now to your question: probably no. It's highly probable that your 100 high volume tables are responsible for most of the changes in the stream, thus making the other 800 "hourly overwrites" will not drastically reduce cpu usage. That is, if for example 80% of the cpu load is caused by these 100 tables then even completly removing the others won't help.

If this is their only data source I strongly suggest to leverage a Read Replica for their analytical workload, given that the data volume is not that high.

answered 8 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions