My AWS Glue extract, transform, and load (ETL) job fails with the error "Container killed by YARN for exceeding memory limits".
Short description
The following are common causes of this error:
- Memory-intensive operations that exceed the memory threshold of the underlying Apache Spark cluster. These operations can include when you join large tables or process datasets with a skew in the distribution.
- Fat partitions of data that consume more memory than what is assigned to the respective executor.
- Large files that you can't split and result in large in-memory partitions.
Resolution
To resolve this error, complete one or more of the following solutions.
Upgrade the work type
Because G.2x has higher memory configurations, upgrade the worker type. You can upgrade the worker type from G.1x to the following worker types:
- G.2x
- G.4x
- G.8x
- G.12x
- G.16x
- R.1x
- R.2x
- R.4x
- R.8x
For more information on specifications of worker types, see Defining job properties for Spark jobs and AWS Glue versions.
Increase the executors for the job
If the error persists after you upgrade the worker type, then increase the number of executors for the job. For each executor, there are a certain number of cores. This number determines the number of partitions that the executor can process. The worker type defines the Spark configurations for the data processing units (DPUs).
Update your data
To make sure AWS Glue evenly uses executors before a shuffle operation, such as joins, verify that your data is parallel. To repartition data across all executors, include one of the following commands in your ETL job.
For DynamicFrame, include the following command:
dynamicFrame.repartition(totalNumberOfExecutorCores)
For DataFrame, include the following command:
dataframe.repartition(totalNumberOfExecutorCores)
Use job bookmarks
When you use job bookmarks, the AWS Glue job process only the newly written files. This configuration reduces the number of files that the AWS Glue job processes and reduces memory issues. Bookmarks store the metadata from processed files from the previous run. In the runs that follow, the job compares the timestamp and then decides whether to process these files again. For more information, see Tracking processed data using job bookmarks.
Use DynamicFrame to read data in parallel
When you connect to a JDBC table, Spark opens only one concurrent connection by default. The driver tries to download the whole table at once in a single Spark executor. This download might take longer and can cause out-of-memory (OOM) errors for the executor. Instead, configure specific properties of your JDBC table to instruct AWS Glue to use DynamicFrame to read data in parallel. Or, you can use Spark DataFrame to achieve parallel reads from JDBC. For more information, see JDBC to other databases on the Spark website.
Use performant functions in your ETL job
For your ETL job, don't use user-defined functions, especially when you combine Python or Scala code with Spark's functions and methods. For example, don't use the Spark df.count() to verify empty DataFrames within if/else statements or for loops. Instead, use better performant functions, such as df.schema() or df.rdd.isEmpty().
Test and optimize the AWS Glue job
Before you run the AWS Glue job in production, test the AWS Glue job on an interactive session and optimize the ETL code.
If none of the preceding solution options work, then split the input data into chunks or partitions. Then, run multiple AWS Glue ETL jobs instead of running a single large job. For more information, see Workload partitioning with bounded execution.
Related information
Debugging OOM exceptions and job abnormalities
Best practices to scale Spark jobs and partition data with AWS Glue
Optimize memory management in AWS Glue