Glue ETL script, Java keeps running out of memory

0

Hi all, I'm relatively new to Glue, but I've got a Python ETL script that I've built that works pretty well. It reads two CSV files into dataframes and then unions them together into one normalized dataframe. I then use that to create other dataframes I write to a MySQL database using JDBC. This works well when the DB is empty (in my Sandbox), but if I run it a second time with a different set of data files, it runs for an hour and then reports in cloudwatch that Java is out of memory. The input files aren't particularly large, ~ 55K lines of CSV data. When I run the script with the second set of input data and watch the metrics of the job in the AWS Console I see only one executor running, which eventually uses all the resources and is removed for lack of memory. I also see no workers running at all. At this point the system just seems to hang for a hour till it ends, I guess for a timeout reason? Can anyone help me understand what's going on and how to address this problem? Thanks! Doug

asked 4 months ago206 views
1 Answer
0

I think, the issue seems due to the way AWS Glue handles concurrent runs of the same job. When you run the same job multiple times with different input data, AWS Glue will reuse the same executors and resources from the previous run, which can lead to memory issues. You may consider 1) Increase the worker memory, 2) Try Glue's built-in data frame capabilities , 3) Try Glue's data catalog partitioning, 4) Try Glue's bounded execution. You may try troubleshooting using this reference - https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-oom-abnormalities.html

AWS
answered 4 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions