EMR Cluster failure with "On the master instance, application provisioning failed"

7 minute read
Content level: Advanced
3

This article might help to investigate the EMR cluster that terminated with error mentioned as "On the master instance, application provisioning failed".

When an EMR cluster provisioning failure accompanied by the below shown exception, it is recommended to consult the logs available in the designated S3 log destination, provided they have been successfully published during the termination process. Alternatively, engaging the development mode facilitates direct login access to the primary node, enabling further investigation into the root cause of the failure.


On the master instance, application provisioning failed


Go to instance groups in the Instances(Hardware) tab, and note the EC2 instance ID of primary node or any node that bootstrap action failed. Choose S3 bucket log destination of the failed cluster and go to the instance ID that referred as primary node in this path, S3-Bucket/<EMR-Cluster>/node/<Primary node EC2 InstanceID>/ bootstrap-actions/master.log.gz

2024-04-09 17:34:12,690 INFO i- xxxxxxxxxxxxxx1: new instance started
2024-04-09 17:34:12,695 ERROR i- xxxxxxxxxxxxxx1: failed to start. bootstrap action 1 failed with non-zero exit code.
2024-04-09 17:34:30,236 INFO i- xxxxxxxxxxxxxx2: new instance started
2024-04-09 17:34:32,257 INFO i- xxxxxxxxxxxxxx2: all bootstrap actions complete and instance ready

In the master log, we can discern which instance encountered failure and the corresponding bootstrap action (BA) script number. In the aforementioned example, it indicates that "bootstrap action 1" failed, as there was no BA scripts were provided during the EMR cluster provisioning process. Therefore, this refers to a BA script provisioned by EMR. In the event that we provide three BA scripts and all three execute successfully, but the EMR provisioned BA script subsequently fails, the log would then indicate "bootstrap action 4" as the point of failure.

In this example, “boostrap action 1” / "EMR provisioned BA script" failed, we will further go ahead and check the provision-node log. Go to S3-Bucket/<EMR-Cluster>/node/<Primary node EC2 InstanceID>/provision-node/apps-phase/0/<uuid>/. Here there would be three files namely puppet.log.gz, stderr.gz & stdout.gz available. These logs available in the primary node under /var/log/provision-node directory when you login into the node to troubleshoot the same.

In puppet.log, check if any error reported. Error entry would be in this format: (err)

For an example below, spark-redshift package was unable to install,

2024-04-10 15:02:49 +0000 /Stage[main]/Spark::Spark_redshift/Package[spark-redshift]/ensure (err): change from 'purged' to 'latest' failed: Could not update: Execution of '/bin/yum -d 0 -e 0 -y list spark-redshift' returned 1: 
Existing lock /var/run/yum.pid: another copy is running as pid 20416.
Another app is currently holding the yum lock; waiting for it to exit...
  The other application is: python2.7
    Memory : 218 M RSS (439 MB VSZ)
    Started: Wed Apr 10 15:02:24 2024 - 00:11 ago
    State  : Running, pid: 20416

Another below example indicates that puppet was unable to configure the embedded metastore for the spark application,

2024-04-04 16:13:00 +0000 /Stage[main]/Spark::Common/Exec[Add livy permission for embedded metastore dir] (info): Starting to evaluate the resource (394 of 669) 2024-04-04 16:13:00 +0000 Puppet (err): sh: line 1: setfacl: command not found 
2024-04-04 16:13:00 +0000 /Stage[main]/Spark::Common/Exec[Add livy permission for embedded metastore dir]/returns (err): change from 'notrun' to ['0'] failed: sh: line 1: setfacl: command not found 
2024-04-04 16:13:00 +0000 /Stage[main]/Spark::Common/Exec[Add livy permission for embedded metastore dir] (info): Evaluated in 0.01 seconds

In stderr & stdout log, check if any explicit errors reported in these file as well

In this example, I encountered "Failed to install packages"

2024-04-10 15:03:30,060 INFO NodeProvisionerWorkflow: Finished processing puppet report for failure message
2024-04-10 15:03:30,163 ERROR Program: Encountered a problem while provisioning
com.amazonaws.emr.node.provisioner.puppet.api.PuppetException: Could not update: Execution of '/bin/yum -d 0 -e 0 -y list spark-redshift' returned 1: Existing lock /var/run/yum.pid: another copy is running as pid 20416.\nAnother app is currently holding the yum lock; waiting for it to exit...\n  The other application is: python2.7\n    Memory : 218 M RSS (43...
	at com.amazonaws.emr.node.provisioner.workflow.NodeProvisionerWorkflow.work(NodeProvisionerWorkflow.java:162) ~[node-provisioner-2.48.0.jar:?]
	at com.amazonaws.emr.node.provisioner.Program.main(Program.java:31) [node-provisioner-2.48.0.jar:?]

2024-04-09 17:49:30,566 ERROR Program: Encountered a problem while provisioning
java.lang.RuntimeException: Amazon-linux-extras topics enabling or yum packages installation failed.
	at com.amazonaws.emr.node.provisioner.install.packaging.AL2CompatibleYumPackageInstaller.install(AL2CompatibleYumPackageInstaller.java:57) ~[node-provisioner-2.54.0.jar:?]
	at com.amazonaws.emr.node.provisioner.install.ComponentInstaller.install(ComponentInstaller.java:42) ~[node-provisioner-2.54.0.jar:?]
	at com.amazonaws.emr.node.provisioner.install.PackageProvisioner.provision(PackageProvisioner.java:46) ~[node-provisioner-2.54.0.jar:?]
	at com.amazonaws.emr.node.provisioner.workflow.NodeProvisionerWorkflow.doWork(NodeProvisionerWorkflow.java:237) ~[node-provisioner-2.54.0.jar:?]
	at com.amazonaws.emr.node.provisioner.workflow.NodeProvisionerWorkflow.work(NodeProvisionerWorkflow.java:129) ~[node-provisioner-2.54.0.jar:?]
	at com.amazonaws.emr.node.provisioner.Program.main(Program.java:31) [node-provisioner-2.54.0.jar:?]
Caused by: java.lang.RuntimeException: Failed to install packages.

If you do not find any (err) in the puppet/stderr files in the apps-phase, go to S3-Bucket/<EMR-Cluster>/node/<Primary node EC2 InstanceID>/ provision-node/reports/0/<uuid>/<ip-addr>/ and choose the files namely <timestamp>.yaml.gz or <timestamp>.json.gz and confirm if any package installation failed during the package configuration. You can search the below keyword and see if any occurrence of failure.

failed: true

If any component failed, we will find the package name and error message in the key value format. As highlighted in the example below, spark-redshift package failed due to yum lock.

    - package
    - spark-redshift
    …
    failed: true
    failed_to_restart: false
    changed: false
    out_of_sync: true
    skipped: false
    change_count: 0
    out_of_sync_count: 1
    events:
    - audited: false
      property: ensure
      previous_value: purged
      desired_value: latest
      historical_value: 
      message: |-
        change from 'purged' to 'latest' failed: Could not update: Execution of '/bin/yum -d 0 -e 0 -y list spark-redshift' returned 1: Existing lock /var/run/yum.pid: another copy is running as pid 20416.
        Another app is currently holding the yum lock; waiting for it to exit...

If no issues are identified in the aforementioned steps, we can proceed further and examine the application logs. Initially, we should analyze the Hadoop-hdfs and Hadoop-yarn logs. Navigate to the path S3-Bucket/<EMR-Cluster>/node/<EC2 InstanceID>/applications/Hadoop-hdfs/ and inspect the hadoop-hdfs-namenode-<ipaddress>.log.gz file. This log file should be scrutinized for any potential issues related to the creation and registration of the NameNode with the DataNodes. In case, the datanode successfully registered, we can see the below output,

2024-04-10 15:03:02,656 INFO org.apache.hadoop.net.NetworkTopology (IPC Server handler 32 on default port 8020): Adding a new node: /default-rack/xxxxxxxxxxxxx:9866
2024-04-10 15:03:02,656 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager (IPC Server handler 32 on default port 8020): DatanodeManager.addDatanode: node xx.xx.xx.xx:9866 with UUID 75ddcac5-2124-4d34-814c-d10e4db34781 is added to datanodeMap.
2024-04-10 15:03:02,656 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockReportLeaseManager (IPC Server handler 32 on default port 8020): Registered DN 75ddcac5-2124-4d34-814c-d10e4db34781 (xx.xx.xx.xx:9866).

Similarly, navigate to the path S3-Bucket/<EMR-Cluster>/node/<EC2 InstanceID>/applications/Hadoop-yarn/ and inspect the hadoop-yarn-resourcemanager-<ipaddress>.log.gz file. This log file should be examined for any potential issues related to the creation of the ResourceManager and its registration with the NodeManagers. In case, the node manager successfully registered, we can see the below output,

2024-04-10 15:03:42,311 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService (IPC Server handler 2 on default port 8025): NodeManager from node ip-xxxxxxxxx.ec2.internal(cmPort: 8041 httpPort: 8042) registered with capability: <memory:12288, vCores:4>, assigned nodeId ip-xxxxxxxxxxxx.ec2.internal:8041
2024-04-10 15:03:42,315 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl (RM Event dispatcher): ip-xxxxxxxxxxxxxx.ec2.internal:8041 Node Transitioned from NEW to RUNNING
2024-04-10 15:03:42,355 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler (SchedulerEventDispatcher:Event Processor): Added node ip-xxxxxxxxxxxxxx.ec2.internal:8041 clusterResource: <memory:12288, vCores:4>

In the event of any error, the node registration step would fail, and referring to the aforementioned logs would provide insights to further troubleshoot the issue. Additionally, if other applications are enabled in the cluster, you can inspect their individual app logs by referring to the respective application directories, such as hive/livy/presto/spark. The same logs would be reviewed from the primary node when accessed via SSH. More detailed information regarding this process & log location is available in this document.

If you do not find the logs available in S3, set development mode on the cluster to get into the primary node and troubleshoot the issue. When starting your EMR cluster, set the "--additional-info" parameter to

'{"clusterType":"development"}'

When this flag is set and the primary node fails to provision, then EMR service keeps the cluster alive for some time before it decommissions it. This is very useful for probing various log files before the cluster is terminated. You can terminate the EC2 instance after the investigation complete, to avoid the additional billing for EC2 instance. Please note this parameter can only be set through the AWS CLI or AWS SDK and is not available through the EMR console. Please refer this document for EMR create-cluster CLI reference to setting up the development mode.

AWS
SUPPORT ENGINEER
published a month ago961 views