Skip to content

Why does my Spark job in Amazon EMR fail?

8 minute read
0

I want to troubleshoot my Apache Spark job that fails in Amazon EMR.

Resolution

Application failures

"Spark shuffle block fetch" runtime exception

When the executor worker node in Amazon EMR is in an unhealthy state, you might receive the following error:

"ERROR ShuffleBlockFetcherIterator: Failed to get block(s) from ip-192-168-14-250.us-east-2.compute.internal:7337

org.apache.spark.network .client.ChunkFetchFailureException: Failure while fetching StreamChunkId[streamId=842490577174,chunkIndex=0]: java.lang.RuntimeException: Executor is not registered (appId=application_1622819239076_0367, execId=661)"

When disk utilization for a worker node exceeds the 90% utilization threshold, the YARN NodeManager health service identifies the node as UNHEALTHY. Amazon EMR includes unhealthy nodes in deny lists and YARN containers aren't allocated to the unhealthy nodes.

To troubleshoot this issue, take the following actions:

"NoSuchElementException" runtime exception

When there is a problem within the application code and the SparkContext initialization, you might receive the following exception:

"ERROR [Executor task launch worker for task 631836] o.a.s.e.Executor:Exception in task 24.0 in stage 13028.0 (TID 631836) java.util.NoSuchElementException: None.get"

To resolve this issue, make sure that there aren't multiple SparkContext jobs active within the same session. You can have one active SparkContext at a time. If you want to initialize another SparkContext, then you must stop the active job before you create a new one.

For more information, see SparkContext on the Spark website.

"Container exit code 137" error

When the task exceeds its allocated physical memory, a YARN container stops the task and you receive the following error:

"Container killed on request. Exit code is 137"

You receive this error when you have shuffle partitions, inconsistent partition sizes, or a large number of executor cores.

Review error details in the Spark driver logs to determine the cause of the error. For more information, see How do I access Spark driver logs on an Amazon EMR cluster?

Example error from the driver log:

ERROR YarnScheduler: Lost executor 19 on ip-10-109-##-###.aws.com : Container from a bad node: container_1658329343444_0018_01_000020 on host: ip-10-109-##-###.aws.com . Exit status: 137.Diagnostics:Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137.
Killed by external signal
Executor container 'container_1658329343444_0018_01_000020' was killed with exit code 137. To understand the root cause, you can analyze executor container log.
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 23573"...

The preceding error stack trace shows that there isn't enough available memory on the executor to continue to process data. This error might happen in different job stages, in both narrow and wide transformations.

To resolve this issue, take the following actions:

  • Increase executor memory.
    Note: Executor memory includes memory required to execute the tasks and overhead memory. The sum of these must not be greater than the size of Java Virtual Machine (JVM) and the YARN maximum container size.
  • Add more Spark partitions.
  • Increase the number of shuffle partitions.
  • Reduce the number of executor cores.

For more information, see How do I resolve "Container killed on request. Exit code is 137" errors in Spark on Amazon EMR?

Spark jobs are in a hung state and don't complete

Spark jobs might be in stuck for multiple reasons. For example, changes to the Spark driver process or loss of executor containers can halt jobs.

Spark jobs might be stuck when you have high disk space utilization, or when you use Spot Instances for cluster nodes and AWS terminates the Spot Instance. For more information, see How do I resolve an "ExecutorLostFailure: Slave lost" error in Spark on Amazon EMR?

To troubleshoot this issue, take the following actions:

  • Review the Spark driver or driver logs for exceptions.
  • Check the YARN node list for unhealthy nodes. When disk utilization exceeds the utilization exceeds the threshold on a core node, the YARN Node Manager health service marks the node as UNHEALTHY. Amazon EMR adds the unhealthy nodes to deny lists and prevents YARN from allocating containers to those nodes.
  • Monitor disk space utilization and configure Amazon Elastic Block Store (Amazon EBS) volumes to keep utilization below 90% for Amazon EMR cluster worker nodes.

“Heartbeat communication” error

Spark executors send heartbeat signals to the Spark driver at intervals that the spark.executor.heartbeatInterval property specifies. When long garbage collection pauses occur, executors might not send heartbeat signals. The driver stops executors that fail to send a heartbeat signal for more than the value specified and you receive the following error:

"WARN Executor: Issue communicating with driver in heartbeater org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval"

Memory constraints or out of memory (OOM) issues cause timeout exceptions when the executor processes data. These issues also influence the garbage collection process and might create further delay.

To resolve heartbeat communication errors, use one of the following options:

  • Increase executor memory. Also, depending on the application process, repartition your data.
  • Tune garbage collection. For more information, see Garbage collection tuning on the Apache Spark website.
  • Increase the interval for spark.executor.heartbeatInterval.
  • Specify a longer spark.network.timeout period.

"ExecutorLostFailure" error

When high disk utilization causes a core or task node to terminate, you might receive the following error:

"ExecutorLostFailure "Exit status: -100. Diagnostics: Container released on a *lost* node"

You might also receive the preceding error when a node becomes unresponsive because of prolonged high CPU utilization or low available memory. For troubleshooting steps, see How do I resolve "Exit status: -100. Diagnostics: Container released on a lost node" error in Amazon EMR?

Note: This error might also occur when you use Spot Instances for cluster nodes, and AWS terminates a Spot Instance. The Amazon EMR cluster provisions an On-Demand Instance to replace the terminated Spot Instance and the application might recover on its own. For more information, see Spark enhancements for elasticity and resiliency on Amazon EMR.

"SQL connection timeout" error

When a database connection attempt fails because of a network timeout, you receive the following error:To resolve this issue, verify that the database host can receive incoming connections on Port 1433 from your Amazon EMR cluster security groups.

Also, review the maximum number of parallel database connections configured for the SQL database and the memory allocation for the database instance class. Database connections also consume memory. If utilization is high, then review the database configuration and the number of allowed connections. For more information, see Maximum number of database connections.

Amazon S3 exceptions

HTTP 503 "Slow Down"

HTTP 503 exceptions occur when you exceed the Amazon Simple Storage Service (Amazon S3) request rate for the prefix. A 503 exception doesn't always mean that a failure might occur. However, if you resolve the exception, then you might improve your application's performance.

For more information, see Why does my Spark or Hive job on Amazon EMR fail with an HTTP 503 "Slow Down" AmazonS3Exception?

HTTP 403 "Access Denied"

HTTP 403 errors are caused by incorrect or invalid credentials, including the following credentials:

  • Credentials or roles that you didn't specify in your application code.
  • The policy that's attached to the Amazon Elastic Compute Cloud (Amazon EC2) instance profile role.
  • Amazon Virtual Private Cloud (Amazon VPC) endpoints for Amazon S3.
  • Amazon S3 source and destination bucket policies.

To resolve 403 errors, make sure that the relevant AWS Identity and Access Management (IAM) role or policy allows access to Amazon S3. For more information, see Why does my Amazon EMR application fail with an HTTP 403 "Access Denied" AmazonS3Exception?

HTTP 404 "Not Found"

When the application expects to find an object in Amazon S3, but at the time of the request, the object wasn't found, you receive the "HTTP 404 Not found" error.

The error might be caused by the following reasons:

  • Incorrect Amazon S3 paths.
  • A process outside of the application moved or deleted the file.
  • An operation caused eventual consistency problems, such as an overwrite.

For more information, see Why does my Amazon EMR application fail with an HTTP 404 "Not Found" AmazonS3Exception?

AWS OFFICIALUpdated 7 months ago