What planning steps can I take when upgrading my Amazon EMR cluster?
I need to plan an Amazon EMR upgrade to keep pace with rapidly changing technology.
Short description
To keep up with the rapid changes in big data, you must upgrade your version of Amazon EMR. Migrating to a new version of Amazon EMR improves the operational excellence and efficacy of your workload. However, before you upgrade Amazon EMR, you must plan and prepare. There's information that you must review and procedures that you must follow.
Benefits of Amazon EMR version upgrades
Benefits of upgrading Amazon EMR include:
- Increased productivity and lowered costs by leveraging the newest features.
- Updated applications run faster.
- Up-to-date bug fixes provide a stable infrastructure.
- The latest security patches strengthen security.
- Up-to-date access to open-source software features.
For example, with Amazon EMR version 6.6 and later, Log4j 1.x and Log4j 2.x are upgraded to Log4j 1.2.17 and Log4j 2.17.1 (or higher), respectively. In the higher versions, bootstrap actions aren't required to mitigate common vulnerabilities and exposures (CVEs).
Resolution
Amazon EMR performance optimization features for open-source applications
Amazon EMR offers performance optimization features for many open-source applications.
Spark:
- Adaptive query execution
- Dynamic partition pruning
- Flattening scalar subqueries
- DISTINCT before INTERSECT
- Bloom filter join
- Optimized join reorder
- Improved Spark performance with Amazon Simple Storage Service (Amazon S3)
- Spark release history: Before deciding to upgrade Amazon EMR, check the version of Spark and its installed components in Amazon EMR releases.
Delta Lake:
- Using a Delta Lake cluster with Spark
- Using a Delta Lake cluster with Trino
- Delta release history: Before deciding to upgrade Amazon EMR, check the version of Delta Lake and its installed components in Amazon EMR releases.
Flink:
- Flink supported as a YARN application
- Flink release history: Before deciding to upgrade Amazon EMR, check the version of Flink and its installed components in Amazon EMR releases.
Hadoop:
- Transparent encryption in Hadoop Distributed File System (HDFS)
- Non-uniform memory access awareness for YARN containers
- Hadoop version history: Before deciding to upgrade Amazon EMR, check the version of Hadoop and its installed components in Amazon EMR releases.
HBase:
- HBase on Amazon S3
- HBase read replica clusters
- HBase snapshots
- HBase release history: Before deciding to upgrade Amazon EMR, check the version of HBase and its installed components in Amazon EMR releases.
HCatalog:
- Integrations with Amazon EMR releases
- Using the AWS Glue Data Catalog as the metastore for Apache Hive
- HCatalog release history: Before deciding to upgrade Amazon EMR, check the version of HCatalog and its installed components in Amazon EMR releases.
Hive:
- ACID transactions and Amazon S3
- Hive Live Long and Process (LLAP)
- Improve Hive performance
- Start the Hive EMR File System (EMRFS) S3 Optimized Committer
- Using S3 Select with Hive to improve performance
- Metastore check command (MSCK) optimization
- Hive release history: Before deciding to upgrade Amazon EMR, check the version of Hive and its installed components in Amazon EMR releases.
Hudi:
- Integrations with Amazon EMR releases
- Hudi release history: Before deciding to upgrade Amazon EMR, check the version of Hudi and its installed components in Amazon EMR releases.
Iceberg:
- Integrations with Amazon EMR releases
- Iceberg release history: Before deciding to upgrade Amazon EMR, check the version of Iceberg and its installed components in Amazon EMR releases.
Presto and Trino:
- Integrations with Amazon EMR releases
- Using S3 Select Pushdown with Presto to improve performance
- Adding database connectors
- Activating Presto strict mode
- Exchange Manager
- Using Presto automatic scaling with Graceful Decommission
- Presto release history and Trino release notes: Before deciding to upgrade Amazon EMR, check the version of Presto or Trino and its installed components in Amazon EMR releases.
Planning for Amazon EMR version upgrades
Follow these steps to prepare for an Amazon EMR version upgrade:
- Research the issues that you're facing in your current Amazon EMR version.
- Isolate a small subset of applications or queries that you want to use to test your EMR cluster's performance.
- Set up an A/B testing strategy to decide the Amazon EMR version that's best for your solution. In A/B testing for Amazon EMR, you test two different versions of the service to compare how they perform in your environment.
- Gradually migrate the workload to the new version of Amazon EMR. If you discover major problems on the production version of Amazon EMR, you can end the migration process here.
- After migration is complete, terminate the old Amazon EMR cluster.
Fixing issues related to Amazon EMR version upgrades
Follow these steps to fix issues that you encounter when upgrading your Amazon EMR version:
- Reconfigure the application. Observe whether or not the changes improve your application's performance.
- Check if issues have been resolved by a newer version of the application.
- Change the application or queries to see if you can avoid issues.
- Check open defects and workarounds to improve the application. Contact AWS Premium Support to find out if there's a workaround.
- Stop the Amazon EMR migration until the issue is fixed or a workaround exists.
Considerations for Amazon EMR version upgrades
When you upgrade your version of Amazon EMR, performance regression might cause issues. Upgrades might change the API, which might affect your code's ability to run on a newer interface. Application slowness and failures might occur after an Amazon EMR version upgrade.
When you're thinking of upgrading your version of Amazon EMR, it's a best practice to read the What's new? section of the release guide. The What's new? section includes information about Amazon EMR release versions and dates, along with solutions to common issues with open-source applications.
Research open-source application changes and outstanding issues
Check the following release notes and open defects before deciding to migrate to a new Amazon EMR version. The following list of applications are based on Amazon EMR version 6.9.
Note: These hyperlinks take you to the third-party application websites, GitHub, or the Apache website.
- Flink release notes under Upgrade Flink and issue tracking
- Ganglia release notes and issue tracking
- Hadoop release notes and issue tracking
- HBase release notes and issue tracking
- HCatalog release notes and issue tracking
- Hive release notes and issue tracking
- Hue release notes and issue tracking
- JupyterEnterpriseGateway release notes and issue tracking
- JupyterHub release notes and issue tracking
- Livy release notes and issue tracking
- MXNet release notes and issue tracking
- Oozie release notes and issue tracking
- Phoenix release notes and issue tracking
- Pig release notes and issue tracking
- Presto release notes and issue tracking
- Spark release notes and issue tracking
- Sqoop release notes under Releases and issue tracking
- TensorFlow release notes and issue tracking
- Tez release notes and issue tracking
- Trino release notes and issue tracking
- Zeppelin release notes and issue tracking
- ZooKeeper release notes and issue tracking
Relevant content
- Accepted Answerasked a year agolg...
- asked 4 years agolg...
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 10 months ago
- AWS OFFICIALUpdated 4 months ago