- Newest
- Most votes
- Most comments
Hello,
Thank you for writing on re:Post.
I see that you have some doubts regarding the working of EMR Managed Scaling.
You can verify the Metrics associated with Managed Scaling mentioned in [1]. You need to check if the metrics crossed the threshold and if the cluster scaled. Because Managed Scaling only scales up if any of the metrics defined above crosses the defined threshold.
Unfortunately, we do not have logs specific to Managed Scaling but on the EMR Console in the Events tab of Cluster details you can see scaling events with details.
The Scaling logic works externally to the cluster on AWS end and the cluster load/health does not impact the scaling logic.
If there is very frequent scaling on the cluster the list of Cluster instances will grow as it contains all the instances used by the cluster over its lifetime.
If you notice, very frequent scaling, I will suggest you to fix the number of Core nodes and add Task nodes with scaling as a best practice. This is recommended as scaling of Core nodes require frequent moving of HDFS data which is risky to the node and the data as well.
I will list below the Best Practices guide{2] for Managed Scaling for your reference. It also contains Best practices for the complete EMR cluster as well.
I hope I managed to address your concern. If you have any further question or concerns, a Support case with more details will be really helpful in resolving your queries.
Thanks and have a great day ahead!
Reference Links - [1] https://docs.aws.amazon.com/emr/latest/ManagementGuide/managed-scaling-metrics.html [2] https://aws.github.io/aws-emr-best-practices/docs/bestpractices/Features/Managed%20Scaling/best_practices
Thanks for your clarification concerning the logs and location of the managed scaling logic. We have already started a cluster with fixed core nodes and variable task nodes a few days ago. We will see, how this works out over the next weeks.
The relevant metrics seem to be available. I updated the question with plots of a particular situation, where the scaling stopped working and one can see a large number of pending containers without an increase in nodes.
The issue you're experiencing with AWS EMR on EC2 Managed Scaling stopping after about two weeks of use is not a typical behavior and could be related to several factors. Here are some insights and suggestions to help troubleshoot the problem:
-
Log Management: EMR has recently improved its on-cluster log management daemon and enabled log rotation for long-running clusters. This enhancement aims to minimize disk, memory, and CPU over-utilization, which could potentially impact scaling operations. You may want to check if your clusters are running on the latest EMR release that includes these improvements.
-
Cluster Metrics: Amazon EMR continuously evaluates cluster metrics to make scaling decisions. It's possible that after two weeks, certain metrics are causing the scaling algorithm to behave unexpectedly. You can monitor managed scaling with CloudWatch metrics to understand how your cluster responds to changes in job load over time.
-
Scaling Logic: The managed scaling logic is primarily executed externally to the cluster, not on the cluster itself. This means that even if there are issues with the cluster, the scaling service should still be operational.
-
Terminated Instances: The growing list of terminated EC2 instances (around 600 core nodes) you've observed could be related to the issue. This might indicate that the cluster is repeatedly trying to scale but failing, possibly due to capacity issues or other constraints.
-
Error Logs: To find specific error logs for managed scaling, you should check the following:
- EMR console: Look for any error messages or warnings in the cluster details.
- CloudWatch Logs: EMR sends logs to CloudWatch, which might contain information about scaling operations.
- EMR Step Logs: If you're running any steps that might impact scaling, check these logs as well.
-
Recent Improvements: AWS has made several enhancements to EMR Managed Scaling, including improved cluster utilization, minimized interruptions, and improved cluster resiliency. Upgrading to the latest EMR version might resolve your issue.
-
Provisioning Timeout: A recent enhancement allows customers to specify a provisioning timeout for Spot Instances during cluster scaling operations. This feature provides better transparency and flexibility when scaling up using Spot Instances and might help in your situation.
-
Task Node Optimization: If you're not already doing so, consider adding task nodes to your cluster configuration. Recent improvements have optimized task node scale-up for EMR clusters launched with instance groups.
-
Manual Intervention: The fact that you can still manually increase the size via the minimum settings in managed scaling suggests that the issue might be with the automatic scaling decision-making process rather than a fundamental problem with adding instances.
To resolve this issue, consider upgrading to the latest EMR version, reviewing your CloudWatch metrics and logs for any anomalies, and potentially reaching out to AWS support for a more in-depth investigation of your specific cluster behavior.
Sources
Improved scalability and resiliency for Amazon EMR on EC2 clusters | AWS Big Data Blog
Capacity Management and Amazon EMR Managed Scaling improvements for Amazon EMR on EC2 clusters | AWS Big Data Blog
4.1 - Managed Scaling | AWS Open Data Analytics
Amazon EMR release 6.3.1 - Amazon EMR
Relevant content
- Accepted Answerasked 4 years ago
- asked 3 years ago
- Accepted Answerasked a year ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
To find the cause of the scaling not happening will need a deep dive with access to cluster details and cluster logs, which is only possible through a support case.