- Newest
- Most votes
- Most comments
I am mainly concerned about full load so I will follow up only on full load. I am migrating only 1 schema.
- The first point is correct but it is very high level. When I have set parallel level as 2, MaxFileSize as 512 MB, I have made sure I am doing doing migration of two tables at a time. This is the work I need to size. Even if there are 10,000 tables , it will only work on 2 tables at a time. DMS writes Full Load in batches. The batch size is determined by MaxFileSize. And I need to just size for that work 2 tables and 512 MB data. This when i tried with CSV was fine with t3.medium but did not work with parquet. This combination works fine with t3.large. I am using small instances because my concern is not speed but cost.
- I am not sure how it will work with S3. When the data is already written in S3, how can LOB applied later ? Anyways, I do not have LOB.
- Load frequency and transaction size. DMS documentation mentions swapping when CDC is happening. Does it also happen in full load ? In my case, I am working with a DB which doesn't have high load.
- This look to be CDC specific.
Next, coming to CSV vs parquet files, we would need to check multiple things like if you were same table with both csv and parquet format and is task settings same. Were the DML and DDL activity during migration same.
This is database doesn't have high frequency transactions (Not even 10/second). DDL are done via deployment and there was no deployment going on during either of the migration.
Thanks for contacting AWS.
I would like to answer your questions individually.
#1 I understand that you are using AWS DMS to migrate MySQL to S3 (Parquet format) with full load + CDC, MaxFullLoadSubTasks=2, and MaxFileSize=512MB and getting OOM error on 4GB instance when free memory reaches 1GB, though same settings work fine with CSV format so you would like to know how to size DMS instance for Parquet format given these parameters? A. Memory and disk space are key factors in selecting an appropriate replication instance for your use case. Following, you can find a discussion of the use case characteristics to analyze to choose a replication instance.
-
Database and table size : Data volume helps determine the task configuration to optimize full load performance. For example, for two 1 TB schemas, you can partition tables into four tasks of 500 GB and run them in parallel. The possible parallelism depends on the CPU resource available in the replication instance. That's why it's a good idea understand the size of your database and tables to optimize full load performance. It helps determine the number of tasks that you can possibly have.
-
Large objects : The data types that are present in your migration scope can affect performance. Particularly, large objects (LOBs) impact performance and memory consumption. To migrate a LOB value, AWS DMS performs a two-step process. First, AWS DMS inserts the row into the target without the LOB value. Second, AWS DMS updates the row with the LOB value. This has an impact on the memory, so it's important to identify LOB columns in the source and analyze their size.
-
Load frequency and transaction size : Load frequency and transactions per second (TPS) influence memory usage. A high number of TPS or data manipulation language (DML) activities leads to high usage of memory. This happens because DMS caches the changes until they are applied to the target. During CDC, this leads to swapping (writing to the physical disk due to memory overflow), which causes latency.
-
Table keys and referential integrity : Information about the keys of the table determines the CDC mode (batch apply or transactional apply) that you use to migrate data. In general, transactional apply is slower than batch apply. For long-running transactions, there can be many changes to migrate. When you use transactional apply, AWS DMS might require more memory to store the changes compared to batch apply. If you migrate tables without primary keys, batch apply will fail and the DMS task moves to transactional apply mode. When referential integrity is active between tables during CDC, AWS DMS uses transactional apply by default.
Next, coming to CSV vs parquet files, we would need to check multiple things like if you were same table with both csv and parquet format and is task settings same. Were the DML and DDL activity during migration same.
#2 Similarly, are there any deterministic way to size for CDC load. A. During CDC phase, If the memory in a replication instance becomes insufficient, this results in writing data to the disk. Reading from the disk can cause latency, which you can avoid by sizing the replication instance with enough memory.
Running multiple tasks or tasks with high parallelism affects CPU consumption of the replication instance. This slows down the processing of the tasks and results in latency.
Memory settings of task also determins the CDC load. I request you to refer to below :
[+] How does AWS DMS use memory for migration? : https://repost.aws/knowledge-center/dms-memory-optimization
[+] How do I troubleshoot an AWS DMS "last error replication task out of memory" error? : https://repost.aws/knowledge-center/dms-troubleshoot-errors
[+] Change processing tuning settings : https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.ChangeProcessingTuning.html
