Skip to content

How do I resolve the error "Container killed by YARN for exceeding memory limits" in Spark on Amazon EMR?

5 minute read
0

I want to troubleshoot the error "Container killed by YARN for exceeding memory limits" in Spark on Amazon EMR.

Resolution

The root cause and the appropriate solution for this error depends on your workload. To troubleshoot the error, use the following troubleshooting methods, in the following order.

Increase memory overhead

Memory overhead is the amount of off-heap memory allocated to each executor. By default, Spark sets the memory overhead to either 10% of executor memory or 384, whichever is higher. Java NIO direct buffers, thread stacks, shared native libraries, and memory mapped files use this memory overhead.

Make gradual increases in memory overhead, up to 25%. The sum of the driver or executor memory and the memory overhead must be less than the yarn.nodemanager.resource.memory-mb for your instance type:

"spark.driver/executor.memory + spark.driver/executor.memoryOverhead < yarn.nodemanager.resource.memory-mb"

If the error occurs in the driver container or executor container, then increase memory overhead only for the container where the error happened. You can increase memory overhead on a running cluster, a new cluster, or when you submit a job.

Running cluster

Modify spark-defaults.conf on the primary node.

For example:

sudo vim /etc/spark/conf/spark-defaults.conf
spark.driver.memoryOverhead 512
spark.executor.memoryOverhead 512

New cluster

Add a configuration object similar to the following example when you launch the cluster:

[
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.driver.memoryOverhead": "512",
      "spark.executor.memoryOverhead": "512"
    }
  }
]

Single job

To increase memory overhead when you run spark-submit, use the --conf option.

Example:

spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --conf spark.driver.memoryOverhead=512 --conf spark.executor.memoryOverhead=512 /usr/lib/spark/examples/jars/spark-examples.jar 100

If you continue to receive the error after you increase the memory overhead, then reduce the number of executor cores.

Reduce the number of executor cores

Note: Reverse any changes that you made to spark-defaults.conf in the preceding section.

When you reduce the number of executor cores, you reduce the maximum number of tasks that the executor can perform, which reduces the amount of memory required. Depending on the driver container that throws the error or the other executor container that gets the error, decrease the number of cores for the driver or the executor.

Running cluster

Modify spark-defaults.conf on the primary node.

Example:

sudo vim /etc/spark/conf/spark-defaults.confspark.driver.cores  3
spark.executor.cores  3

New cluster

Add a configuration object similar to the following example when you launch the cluster:

[
  {
    "Classification": "spark-defaults",
    "Properties": {"spark.driver.cores" : "3",
      "spark.executor.cores": "3"
    }
  }
]

Single job

To reduce the number of executor cores when you run spark-submit, use the —executor-cores option.

Example:

spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --executor-cores 3 --driver-cores 3 /usr/lib/spark/examples/jars/spark-examples.jar 100

If you continue to receive the error message, then increase the number of partitions.

Increase the number of partitions

Note: Reverse any changes that you made to spark-defaults.conf in the preceding section.

To increase the number of partitions, increase the value of spark.default.parallelism for raw resilient distributed datasets, or run a .repartition() operation.

When you increase the number of partitions, you reduce the amount of memory required per partition. Spark heavily uses cluster RAM as an effective way to maximize speed. Therefore, you must monitor memory usage with Ganglia (for Amazon EMR version 6.15) or Amazon CloudWatch agent (for Amazon EMR 7.0 and later). Then verify that your cluster settings and partitioning strategy meet your growing data needs. If you continue to get the "Container killed by YARN for exceeding memory limits" error message, then increase the driver and executor memory.

Increase driver and executor memory

Note: Reverse any changes that you made to spark-defaults.conf in the preceding section.

If the error occurs in either a driver container or an executor container, then increase memory for either the driver or the executor, but not both. Make sure that the sum of driver or executor memory plus driver or executor memory overhead is always less than the value of yarn.nodemanager.resource.memory-mb for your EC2 instance type:

"spark.driver/executor.memory + spark.driver/executor.memoryOverhead < yarn.nodemanager.resource.memory-mb"

Running cluster

Modify spark-defaults.conf on the primary node.

Example:

sudo vim /etc/spark/conf/spark-defaults.conf
spark.executor.memory  1g
spark.driver.memory  1g

New cluster

Add a configuration object similar to the following when you launch the cluster:


[  
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.executor.memory": "1g",
      "spark.driver.memory":"1g",
    }
  }
]

Single job

Use the --executor-memory and --driver-memory options to increase memory when you run spark-submit.

Example:

spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --executor-memory 1g --driver-memory 1g /usr/lib/spark/examples/jars/spark-examples.jar 100

Other troubleshooting steps:

If you continue to receive the error message, take the following actions:

  • Run your application against a sample dataset. Benchmarking is a best practice and it can help you spot slowdowns and skewed partitions that lead to memory problems.
  • Process the minimum amount of required data. If you don't filter your data, or if you filter late in the application run, then excess data might slow down the application. This can increase the chance of a memory exception.
  • Partition your data to ingest only the required data.
  • Use a different partitioning strategy. For example, partition on an alternate key to avoid large partitions and skewed partitions.
  • Your EC2 instance might not have the memory resources required for your workload. Switch to a larger memory-optimized instance type. If you continue to get memory exceptions after changing instance types, try the troubleshooting methods on the new instance.

Related information

Spark configuration on the Apache Spark website

How do I resolve the "java.lang.ClassNotFoundException" error in Spark on Amazon EMR?

AWS OFFICIALUpdated a month ago