- Newest
- Most votes
- Most comments
Given the characteristics you've described, it's challenging to provide an exact time estimate without more detailed information about your specific workload and infrastructure. However, I can provide a rough estimate based on the information given.
For processing 1 TB of data with simple transformations on the EMR cluster you've described, you could expect the processing time to be in the range of 1-3 hours. This estimate takes into account several factors:
-
The cluster size is substantial, with 10 nodes, 40 total cores, and 160 GB total RAM, which should provide good processing power.
-
The transformations are described as simple and stateless, which suggests they shouldn't be too time-consuming.
-
The file sizes vary greatly (10 KB to 1 GB), which could impact processing efficiency. Smaller files may introduce some overhead.
-
The custom parsers in Java for different file formats could add some complexity and processing time.
-
The total daily data size of 2 TB and close to a million files suggests a high volume of smaller files, which could increase processing time.
After processing, storing the data in Redshift and S3 would add some additional time. Writing to S3 is generally quick, but loading into Redshift could take longer, potentially adding another 30 minutes to an hour to the total time.
So, a rough estimate for the entire process of EMR batch processing 1 TB of data and storing it in Redshift and S3 could be in the range of 2-4 hours.
Keep in mind that this is a very rough estimate. Actual processing time can vary significantly based on many factors including the specific nature of your transformations, the efficiency of your custom parsers, network speeds, and the current load on your Redshift cluster. It's recommended to run some tests with a subset of your data to get a more accurate estimate for your specific use case.
Sources
AWS Open Data Analytics
Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift | AWS Big Data Blog
Cloud Data Warehouse – Amazon Redshift Pricing– AWS
Relevant content
- asked 3 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
