Skip to content

Why does my AWS Glue job fail with lost nodes when I migrate a large dataset from Amazon RDS to Amazon S3?

4 minute read
0

I'm using AWS Glue to migrate a large dataset from Amazon Relational Database Service (Amazon RDS) or an on-premises JDBC database to Amazon Simple Storage Service (Amazon S3). My ETL job runs for a long time, and then fails with lost nodes.

Short description

AWS Glue uses a single connection to read the entire dataset. If you migrate a large JDBC table, then the ETL job might run for a long time without progress on the AWS Glue side. Then, the job might eventually fail because of disk space issues (lost nodes). To resolve this issue, read the JDBC table in parallel. If the job still fails with lost nodes, then use a SQL expression as a pushdown predicate.

Resolution

To resolve lose node errors for JDBC datasets, use one or more of the following methods.

Read the JDBC table in parallel

If the table doesn't have numeric columns, such as INT or BIGINT, then use the hashfield option to partition the data. Set hashfield to the name of a column in the JDBC table. For best results, choose a column that has an even distribution of values.

If the table has numeric columns, then set the hashpartitions and hashexpression options in the table or while you create the DynamicFrame. For more information, see Reading from JDBC tables in parallel.

The following is an example of how to set hashpartitions and hashexpression when you create a DynamicFrame with a JDBC connection. In the connection_option, replace the JDBC URL, username, password, table name, and column name.

connection_option= {"url": "jdbc:mysql://mysql-instance1.123456789012.us-east-1.rds.amazonaws.com:3306/database", "user": "your_user_name", "password": "your_password","dbtable": "your_table","hashexpression":"column_name","hashpartitions":"10"}
datasource0 = glueContext.create_dynamic_frame.from_options('mysql',connection_options=connection_option,transformation_ctx = "datasource0")

Note: Replace the JDBC URL, your_user_name, your_password, your_table, and column_name with your information.

The following is an example of how to set hashpartitions and hashexpression when you create a DynamicFrame from the AWS Glue Data Catalog:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "your_database", table_name = "your_table",additional_options={"hashexpression":"column_name","hashpartitions":"10"}, transformation_ctx = "datasource0")

Note: If you set larger values for hashpartitions, then you can reduce your table's performance. That's because each task reads the entire table, and then returns a set of rows to the executor.

Use a SQL expression as a pushdown predicate

Note: The following SQL expression doesn't work as a pushdown predicate for Oracle databases. This expression works as a pushdown predicate for all other databases that AWS Glue natively supports. These databases include Amazon Aurora, MariaDB, Microsoft SQL Server, MySQL, and PostgreSQL.

If the table contains billions of records and tebibytes (TiB) of data, then the job might take a long time to complete or fail with lost nodes. This delay or failure can occur even after you set hashpartitions and hashexpression. To resolve these issues, use a SQL expression similar to the following with the hashexpression option:

column_name > 1000 AND column_name < 2000 AND column_name

The SQL expression acts as a pushdown predicate. The expression forces the job to read one set of rows per job run, rather than read all the data at once. The full statement looks similar to the following:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "sampledb", table_name = "test_table",additional_options={"hashexpression":"column_name > 1000 AND column_name < 2000 AND column_name","hashpartitions":"10"}, transformation_ctx = "datasource0")

Note: Turn off job bookmarks for initial job runs with this configuration. When you run a job with a job bookmark, AWS Glue records the maximum value of the column. When you run the job again, AWS Glue processes only the rows that have values greater than the previous bookmark value. Turn on job bookmarks during the last job run, as needed.

Related information

Why is my AWS Glue job failing with the error "Exit status: -100. Diagnostics: Container released on a *lost* node"?

Connecting to data

AWS OFFICIALUpdated 8 months ago