EMR Hive read/write performance issues when using S3 as storage layer


We recently migrated part of our Big Data platform on Cloudera to AWS EMR. We are seeing significant read/write performance degradation using Hive with the storage layer as S3. Performance is acceptable when we copy all the data from S3 onto HDFS local block storage on the EMR cluster. However, this has increased our overall costs on the platform. Any suggestions/tips to improving EMR Hive performance on S3 would be appreciated.

asked 2 years ago756 views
1 Answer

I understand that you are experiencing performance degradation using Hive with the storage layer as S3. Given S3 is a storage layer that lives outside of your VPC, it will incur additional network traffic adding latency and cost. Having said that I hope the following will help you minimize the performance degradation if not preventing it.

  1. Please have a look at the Operational differences and considerations which details the drawbacks of having versioned buckets and mitigation by updating the S3 bucket Lifecycle policy delete them frequently in the "/tmp" directory.
  2. Enable Hive EMRFS S3 optimized committer to take advantage of the performance enhancements built into the EMRFS library for S3 on EMR.
  3. Please check your CloudTrail for S3 503 "Slow Down" errors for your Hive jobs, if you do have them please follow the recommendations in the Knowledge Center article on the same.
  4. In worst cases the above recommendation may still not yield any performance improvements, the case I'm referring here is regarding the S3 (internal) partitions which should not be confused with the Data Partitioning. S3 limit is 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned prefix. If there are too many objects in your Bucket which your Queries in Hive are reading, it may trigger GetObject requests for each of these objects, resulting in 503 "Slow Down", in most cases retrying will help however extreme cases even retrying may not succeed as you may have many worker nodes requesting objects at the same time. You will have to reach out to AWS Support to get that increased for your bucket, which should get you better performance for S3 read/write.

I believe the above will help you improve the performance for Hive on S3, please reply to this thread if you have any follow up questions.

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions