Skip to content

How much should be approx time taken for EMR batch processing and storing data in Redshift for 2 TB data with simple transformation

0

How much should be approx time taken for EMR batch processing and storing data in Redshift for 1 TB data with simple transformation. I have following characteristics for data

  • File size varies from 10 KB to 1 GB
  • File formats need custom parsers which are built in Java
  • There are 400 such parsers for different file formats
  • Data keep coming through out day
  • I have persistent cluster
  • Transformations are simple and stateless
  • Data Needs to be stores in Redshift and S3
  • Total Data size in a day is 2 TB
  • Total number of files in a day is close to a million Cluster size is
  • 10 nodes
  • 8 cores per node (40 total cores)
  • 32 GB RAM per node (160 GB total RAM)
1 Answer
0

Given the characteristics you've described, it's challenging to provide an exact time estimate without more detailed information about your specific workload and infrastructure. However, I can provide a rough estimate based on the information given.

For processing 1 TB of data with simple transformations on the EMR cluster you've described, you could expect the processing time to be in the range of 1-3 hours. This estimate takes into account several factors:

  1. The cluster size is substantial, with 10 nodes, 40 total cores, and 160 GB total RAM, which should provide good processing power.

  2. The transformations are described as simple and stateless, which suggests they shouldn't be too time-consuming.

  3. The file sizes vary greatly (10 KB to 1 GB), which could impact processing efficiency. Smaller files may introduce some overhead.

  4. The custom parsers in Java for different file formats could add some complexity and processing time.

  5. The total daily data size of 2 TB and close to a million files suggests a high volume of smaller files, which could increase processing time.

After processing, storing the data in Redshift and S3 would add some additional time. Writing to S3 is generally quick, but loading into Redshift could take longer, potentially adding another 30 minutes to an hour to the total time.

So, a rough estimate for the entire process of EMR batch processing 1 TB of data and storing it in Redshift and S3 could be in the range of 2-4 hours.

Keep in mind that this is a very rough estimate. Actual processing time can vary significantly based on many factors including the specific nature of your transformations, the efficiency of your custom parsers, network speeds, and the current load on your Redshift cluster. It's recommended to run some tests with a subset of your data to get a more accurate estimate for your specific use case.
Sources
AWS Open Data Analytics
Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift | AWS Big Data Blog
Cloud Data Warehouse – Amazon Redshift Pricing– AWS

answered 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.