By using AWS re:Post, you agree to the AWS re:Post Terms of Use

How do I troubleshoot a failed Spark job in Amazon EMR?

5 minute read
2

I want to troubleshoot a failed Apache Spark job in Amazon EMR.

Short description

To troubleshoot failed Spark jobs in Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2), complete the following steps:

  • For Spark jobs that are submitted with --deploy-mode-client, check the step logs to identify the root cause of the step failure.
  • For Spark jobs that are submitted with --deploy-mode-cluster, first check the step logs to identify the application ID. Then, check the application master logs to identify the root cause of the step failure.

To troubleshoot failed Spark jobs on Amazon EMR on Amazon Elastic Kubernetes Service (Amazon EKS), identify the root cause of the Spark job failure. To do this, check the driver logs from Amazon Simple Storage Service (Amazon S3) or Amazon CloudWatch.

To troubleshoot failed Spark jobs on Amazon EMR Serverless, identify the root cause of the Spark job failure. To do this, check the job run details from the Amazon EMR Serverless application console and driver logs.

Resolution

Troubleshoot Amazon EMR on Amazon EC2 failed Spark jobs

Client mode jobs
When a Spark job is deployed in client mode, the step logs provide the job parameters and step error messages. These logs are archived to Amazon S3.

To identify the root cause of a step failure, download the step logs to an Amazon EC2 instance. Then, search for warnings and errors. Complete the following steps:

To decompress the step log file, run the following command:

find . -type f -exec gunzip {} \;

To identify the YARN application ID from the cluster mode log, run the following command:

grep "Client: Application report for" * | tail -n 1

The following example file indicates a memory issue:

"ERROR SparkContext: Error initializing SparkContext.java.lang.IllegalArgumentException: Executor memory 134217728 must be at least 471859200. Please increase executor memory using the --executor-memory option or spark.executor.memory in Spark configuration."

To resolve the preceding error, run the spark-submit command to submit a job with increased memory. For more information, see Submitting applications on the Apache Spark website.
Example:

spark-submit --deploy-mode client --executor-memory 4g --class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/jars/spark-examples.jar

Cluster mode jobs
To identify the application ID that's associated with the failed Spark step, check the stderr step log. The step logs are archived to Amazon S3. Then, identify the application primary logs. Spark jobs that run in cluster mode run in the application primary log. The application primary log is the first container that runs when a Spark job starts. In the following example, container_1572839353552_0008_01_000001 is the first container of the application primary logs.

Example:
s3://aws-logs-111111111111-us-east-1/elasticmapreduce/j-35PUYZBQVIJNM/containers/application_1572839353552_0008/container_1572839353552_0008_01_000001/stderr.gz

After you identify the application primary logs, download the logs to an Amazon EC2 instance. Then, search for warnings and errors.

To decompress the step log file, run the following command:

find . -type f -exec gunzip {} \;

To search for warnings and errors in the container logs, open the container logs that are in the output of the following command:

egrep -Ril "ERROR|WARN" . | xargs egrep "WARN|ERROR"

If a container log indicates a memory issue, then run the following spark-submit command to submit a job with increased memory:

spark-submit --deploy-mode cluster --executor-memory 4g --class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/jars/spark-examples.jar 1000

Troubleshoot Amazon EMR on Amazon EKS failed Spark jobs

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshoot AWS CLI errors. Also, make sure that you're using the most recent AWS CLI version.

When a Spark job is submitted to Amazon EMR on Amazon EKS, logs can be stored on Amazon S3 or CloudWatch. Make sure that you check the driver logs for failed Spark jobs. Also, use kubectl commands to get more details related to the driver and executor logs for the running Spark job.

Note: Kubectl commands work only for active pods. When pods are stopped, you can't run kubectl commands.

If you submit a Spark job with the start-job-run command, then use the following kubectl commands:

kubectl get pods -n example-spark-namespace

Note: Replace example-spark-namespace with the Spark namespace that's used to launch the job.

kubectl logs spark-example-pod-driver -n example-spark-namespace -c spark-kubernetes-driver

Note: Replace example-spark-namespace with the Spark namespace that's used to launch the job and example-pod with the pod name.

If you submit a Spark job with the spark-operator command, then use the following kubectl commands:

kubectl get pods -n example-spark-namespace

Note: Replace example-spark-namespace with the Spark namespace that's used to launch the job.

kubectl logs example-pod-driver -n example-spark-namespace

Note: Replace example-pod with the pod name and example-spark-namespace with the Spark namespace that's used to launch the job.

If you submit a Spark job with the spark-submit command, then use the following kubectl commands. For more information, see Submitting applications on the Apache Spark website.

kubectl get pods -n example-spark-namespace

Note: Replace example-spark-namespace with the Spark namespace that's used to launch the job.

kubectl logs example-pod-driver -n example-spark-namespace

Note: example-spark-namespace with the Spark namespace that's used to launch the job and example-pod with the pod name.

Troubleshoot Amazon EMR Serverless failed Spark jobs

When you submit a Spark job in Amazon EMR Serverless, logging is turned on for all job runs by default. Also, you can turn on Amazon S3 logging for your Amazon S3 bucket. To troubleshoot your failed Spark job, view the job run details, and then choose the Driver log files option. Also, you can check the logs that are stored in CloudWatch to identify the root cause of the failed Spark job.

Related information

Add a Spark step

Running jobs with Amazon EMR on EKS

Logging and monitoring

AWS OFFICIAL
AWS OFFICIALUpdated 3 months ago