How do I troubleshoot stage failures in Spark jobs on Amazon EMR?

2 minute read
0

I want to troubleshoot stage failures in Apache Spark applications on Amazon EMR.

Short description

You might receive stage failures when a Spark task has an issue. Stage failures are caused by hardware issues, incorrect Spark configurations, or code issues. When a stage failure occurs, the Spark driver logs report an exception that's similar to the following:

"org.apache.spark.SparkException: Job aborted due to stage failure: Task XXX in stage YYY failed 4 times, most recent failure: Lost task XXX in stage YYY (TID ZZZ, ip-xxx-xx-x-xxx.compute.internal, executor NNN): ExecutorLostFailure (executor NNN exited caused by one of the running tasks) Reason: (example-reason)"

Resolution

Identify the reason code for Spark jobs that you submit with --deploy-mode client

The reason code is located in the exception that's shown in the terminal.

If you submit the job from Amazon EMR Steps, then the reason code is located in the stderr file on the Amazon EMR console. You can also get the step stderr logs from the Amazon Simple Storage Service (Amazon S3) location that you specified for cluster logging. For example, you can use the s3://example-log-bucket/example-cluster-id/steps/example-step-id/ file path to find the logs.

To identify stage failures in the YARN application logs, run the following command on the primary node:

yarn logs -applicationId example-application-id | grep "Job aborted due to stage failure" -A 10

Note: Replace example-application-id with your Spark application ID.

You can get the YARN application from the Amazon S3 location that you specified for cluster logging. For example, you can use the s3//example-log-bucket/example-cluster-id/containers/example-application-id/ file path. You can also get the YARN application logs from the YARN ResourceManager in the application's primary container.

Resolve the root cause

After you identify the exception, use one of the following AWS Knowledge Center articles to resolve the issue:

AWS OFFICIAL
AWS OFFICIALUpdated 6 months ago