Why was my EMR cluster terminated?
My Amazon EMR cluster terminated unexpectedly.
Resolution
Review Amazon EMR provisioning logs stored in Amazon S3
Amazon EMR cluster logs are stored in an Amazon Simple Storage Service (Amazon S3) bucket that's specified at cluster launch. The logs are stored at s3://example-log-location/example-cluster-ID/node/example-EC2-instance-ID/.
Note: Replace example-log-location, example-cluster-ID, and example-EC2-instance-ID with your system's naming.
The following is a list of common errors:
SHUTDOWN_STEP_FAILED (USER_ERROR)
NO_SLAVES_LEFT (SYSTEM_ERROR)
The master failed: Error occurred: <html>??<head><title>502 Bad Gateway</title></head>??<body>??<center><h1>502 Bad Gateway</h1></center>??<hr><center>nginx/1.16.1</center>??</body>??</html>??
KMS_ISSUE (USER_ERROR)Terminated with errors, The master node was terminated by user.
Note: The preceding are the most common termination errors. EMR clusters might be terminated due to errors other than those listed. For more information, see Resource errors.
SHUTDOWN_STEP_FAILED (USER_ERROR)
When you submit a step job in your EMR cluster, you can specify the step failure behavior in the ActionOnFailure parameter. The EMR cluster terminates if you select TERMINATE_CLUSTER or TERMINATE_JOB_FLOW for the ActionOnFailure parameter. For more information, see StepConfig.
The following is an example error message from AWS CloudTrail:
{ "severity": "ERROR", "actionOnFailure": "TERMINATE_JOB_FLOW", "stepId": "s-2I0GXXXXXXXX", "name": "Example Step", "clusterId": "j-2YJXXXXXXX", "state": "FAILED", "message": "Step s-2I0GXXXXXXXX (Example Step) in Amazon EMR cluster j-2YJXXXXXXX failed at 202X-1X-0X 0X:XX UTC." }
To avoid this error, use the CONTINUE or CANCEL_AND_WAIT option in the ActionOnFailure parameter when submitting the step job.
NO_SLAVES_LEFT (SYSTEM_ERROR)
This error occurs when:
- Termination protection is turned off in the EMR cluster.
- All core nodes exceed disk storage capacity as specified by a maximum utilization threshold in the yarn-site configuration classification. The default maximum utilization threshold is 90%.
- The CORE instance is a Spot Instance, and the Spot Instance is TERMINATED_BY_SPOT_DUE_TO_NO_CAPACITY.
For information on Spot Instance termination, see Why did Amazon EC2 interrupt my Spot Instance?
For more information on the NO_SLAVE_LEFT error, see, see Cluster terminated with NO_SLAVE_LEFT and core nodes FAILED_BY_MASTER.
The following is an example error message from the instance-controller:
202X-0X-0X 1X:5X:5X,968 INFO Poller: InstanceJointStatusMap contains X entries (DD:5 R:3): i-0e336xxxxxxxxxxxx 25d21h R 25d21h ig-22 ip-1x-2xx-xx-1xx.local.xxx.com I: 52s Y:U 98s c: 0 am: 0 H:R 1.1%Yarn unhealthy Reason : 1/4 local-dirs usable space is below configured utilization percentage/no more usable space [ /mnt/yarn : used space above threshold of 90.0% ] ; 1/1 log-dirs usable space is below configured utilization percentage/no more usable space [ /var/log/hadoop-yarn/containers : used space above threshold of 90.0% ]
To resolve this error:
- Keep termination protection ON for your clusters. For more information, see Termination protection and unhealthy YARN nodes.
- Use Amazon EMR scaling policies (automatic scaling and managed scaling) to scale core nodes based on your requirements. For more information, see Use cluster scaling.
- Add more Amazon Elastic Block Storage (Amazon EBS) capacity to your cluster. For more information, see How can I resolve "Exit status: -100. Diagnostics: Container released on a *lost* node" errors in Amazon EMR?
- Create an alarm for the MRUnhealthyNodes Amazon CloudWatch metric. You can set up a notification for this alarm to warn you of unhealthy nodes before the 45-minute timeout is reached. For more information, see Create a CloudWatch alarm based on a static threshold.
502 Bad Gateway
The 502 Bad Gateway error occurs when Amazon EMR internal systems can't reach the primary node for a period of time. Amazon EMR is terminated if termination protection is turned off. Check the latest instance-controller logs and instance state logs when the instance-controller service is down. The instance-controller standard output shows that the service is terminated because there is insufficient memory. This indicates that the primary node of the cluster is low on memory.
The following is an example error message from the instance state log:
# dump instance controller stdout tail -n 100 /emr/instance-controller/log/instance-controller.out OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fb46c7c8000, 12288, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory. # An error report file with more information is saved as: # /tmp/hs_err_pid16110.log # whats memory usage look like free -m total used free shared buff/cache available Mem: 15661 15346 147 0 167 69 Swap: 0 0 0
To avoid the preceding error, launch an EMR cluster with a higher instance type to leverage more memory for your cluster's requirements. Also, clean up disk space to avoid memory outages in long running clusters. For more information, see How do I troubleshoot primary node failure with error "502 Bad Gateway" or "504 Gateway Time-out" in Amazon EMR?
KMS_ISSUE (USER_ERROR)
When using an Amazon EMR security configuration to encrypt an Amazon EBS root device and storage volumes, the role must have proper permissions. If the necessary permissions are missing, then you receive the KMS_ISSUE error.
The following is an example error message from AWS CloudTrail:
The EMR Service Role must have the kms:GenerateDataKey* and kms:ReEncrypt* permission for the KMS key configuration when you enabled EBS encryption by default. You can retrieve that KMS key's ID by using the ec2:GetEbsDefaultKmsKeyId API.
To avoid the preceding error, make sure that security configurations that are used to encrypt the Amazon EBS root device and storage volumes have the necessary permissions. For these configurations, be sure that the Amazon EMR service role (EMR_DefaultRole_V2) has permissions to use the specified AWS Key Management Service (AWS KMS) key.
Terminated with errors, The master node was terminated by user
When the EMR cluster primary node stops for any reason, the cluster terminates with the The master node was terminated by user error.
The following is an example error message from AWS CloudTrail:
eventTime": "2023-01-18T08:07:02Z", "eventSource": "ec2.amazonaws.com", "eventName": "StopInstances", "awsRegion": "us-east-1", "sourceIPAddress": "52.xx.xx.xx", "userAgent": "AWS Internal", "requestParameters": { "instancesSet": { "items": [ { "instanceId": "i-xxf6c5xxxxxxxxxxx" } ] }, "force": false },
Because stopping the EMR primary or all core nodes leads to cluster termination, avoid stopping or rebooting cluster nodes.
Relevant content
- asked a year agolg...
- asked a year agolg...
- asked 2 years agolg...
- asked a year agolg...
- asked 3 months agolg...
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 3 years ago