AWS EMR on EC2 Managed Scaling stops working after 2 weeks of use

0

We use AWS EMR 7.2.0 on EC2 with instance fleets (only Primary, Core, no spot instances) and managed scaling for long term use (weeks). On each of the 3 cluster we started so far, we observed the following: After about 2 weeks of use with no problems, the managed scaling up stops to work and the cluster gets stuck at its minimum size. One can still manually increase the size via the minimum settings in the managed scaling. Is there a way to find specific error logs for the managed scaling? Is the scaling logic executed on the cluster itself or is it an external? We noticed, that the list of terminated EC2 instances recorded in the cluster gets longer over time. At the point when it stops to work, the list of core nodes is usually around 600. The cloud watch metrics seem to be available, but after some time the standard plots show an error because too many metrics are involved.

The relevant metrics seem to be available. Here is an example of a situation (maximum 12 Core nodes, minimum 1 Core node) that stopped scaling before 10/6: Cluster Status: Enter image description here Node Status: Enter image description here

One can see in the cluster status, that there lots of containers pending in periods right up to 10/6. But the cluster is not scaling up. In the event list there are no events between 10/5 and 10/7 (when we scaled up manually).

  • To find the cause of the scaling not happening will need a deep dive with access to cluster details and cluster logs, which is only possible through a support case.

asked 2 months ago134 views
2 Answers
0

Hello,

Thank you for writing on re:Post.

I see that you have some doubts regarding the working of EMR Managed Scaling.

You can verify the Metrics associated with Managed Scaling mentioned in [1]. You need to check if the metrics crossed the threshold and if the cluster scaled. Because Managed Scaling only scales up if any of the metrics defined above crosses the defined threshold.

Unfortunately, we do not have logs specific to Managed Scaling but on the EMR Console in the Events tab of Cluster details you can see scaling events with details.

The Scaling logic works externally to the cluster on AWS end and the cluster load/health does not impact the scaling logic.

If there is very frequent scaling on the cluster the list of Cluster instances will grow as it contains all the instances used by the cluster over its lifetime.

If you notice, very frequent scaling, I will suggest you to fix the number of Core nodes and add Task nodes with scaling as a best practice. This is recommended as scaling of Core nodes require frequent moving of HDFS data which is risky to the node and the data as well.

I will list below the Best Practices guide{2] for Managed Scaling for your reference. It also contains Best practices for the complete EMR cluster as well.

I hope I managed to address your concern. If you have any further question or concerns, a Support case with more details will be really helpful in resolving your queries.

Thanks and have a great day ahead!

Reference Links - [1] https://docs.aws.amazon.com/emr/latest/ManagementGuide/managed-scaling-metrics.html [2] https://aws.github.io/aws-emr-best-practices/docs/bestpractices/Features/Managed%20Scaling/best_practices

AWS
SUPPORT ENGINEER
answered 2 months ago
  • Thanks for your clarification concerning the logs and location of the managed scaling logic. We have already started a cluster with fixed core nodes and variable task nodes a few days ago. We will see, how this works out over the next weeks.

    The relevant metrics seem to be available. I updated the question with plots of a particular situation, where the scaling stopped working and one can see a large number of pending containers without an increase in nodes.

0

The issue you're experiencing with AWS EMR on EC2 Managed Scaling stopping after about two weeks of use is not a typical behavior and could be related to several factors. Here are some insights and suggestions to help troubleshoot the problem:

  1. Log Management: EMR has recently improved its on-cluster log management daemon and enabled log rotation for long-running clusters. This enhancement aims to minimize disk, memory, and CPU over-utilization, which could potentially impact scaling operations. You may want to check if your clusters are running on the latest EMR release that includes these improvements.

  2. Cluster Metrics: Amazon EMR continuously evaluates cluster metrics to make scaling decisions. It's possible that after two weeks, certain metrics are causing the scaling algorithm to behave unexpectedly. You can monitor managed scaling with CloudWatch metrics to understand how your cluster responds to changes in job load over time.

  3. Scaling Logic: The managed scaling logic is primarily executed externally to the cluster, not on the cluster itself. This means that even if there are issues with the cluster, the scaling service should still be operational.

  4. Terminated Instances: The growing list of terminated EC2 instances (around 600 core nodes) you've observed could be related to the issue. This might indicate that the cluster is repeatedly trying to scale but failing, possibly due to capacity issues or other constraints.

  5. Error Logs: To find specific error logs for managed scaling, you should check the following:

  • EMR console: Look for any error messages or warnings in the cluster details.
  • CloudWatch Logs: EMR sends logs to CloudWatch, which might contain information about scaling operations.
  • EMR Step Logs: If you're running any steps that might impact scaling, check these logs as well.
  1. Recent Improvements: AWS has made several enhancements to EMR Managed Scaling, including improved cluster utilization, minimized interruptions, and improved cluster resiliency. Upgrading to the latest EMR version might resolve your issue.

  2. Provisioning Timeout: A recent enhancement allows customers to specify a provisioning timeout for Spot Instances during cluster scaling operations. This feature provides better transparency and flexibility when scaling up using Spot Instances and might help in your situation.

  3. Task Node Optimization: If you're not already doing so, consider adding task nodes to your cluster configuration. Recent improvements have optimized task node scale-up for EMR clusters launched with instance groups.

  4. Manual Intervention: The fact that you can still manually increase the size via the minimum settings in managed scaling suggests that the issue might be with the automatic scaling decision-making process rather than a fundamental problem with adding instances.

To resolve this issue, consider upgrading to the latest EMR version, reviewing your CloudWatch metrics and logs for any anomalies, and potentially reaching out to AWS support for a more in-depth investigation of your specific cluster behavior.

Sources
Improved scalability and resiliency for Amazon EMR on EC2 clusters | AWS Big Data Blog
Capacity Management and Amazon EMR Managed Scaling improvements for Amazon EMR on EC2 clusters | AWS Big Data Blog
4.1 - Managed Scaling | AWS Open Data Analytics
Amazon EMR release 6.3.1 - Amazon EMR

profile picture
answered 2 months ago
AWS
SUPPORT ENGINEER
reviewed 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions