We use DMS to export data from MY SQL to S3 post which we run ETLs. The Glue ETLs use bookmarks hence reads only what has changed from the last time it ran . However the raw data keeps increasing in terms of numerous KiloByte files.
My plan is to
Write a glue job to read all these files and keep creating 256 MB files
Create a retention policy on the DMS end point bucket to delete files older than 90 days
The reason
for selecting 256 MB is , I read somewhere that 256 MB is the preferred file size by Athena . Is that right. ?
for compacting the raw files is to make it easier for any other application to consume the data , that is, read small number of 256 MB files than reading millions of KB files