StackOverflowError on joins in AWS Glue

0

Hello,

We are trying to join some dataframes in Glue using Spark und Python.

The dataframes are created from the same source table, but since we are using like 1000 withColumn operations to rename, divide and add column values, we need to split the tables using select, otherwise the runtime blows up.

Writing all those single dataframes to AWS Glue Catalog (using Iceberg as table format) works for a dataframe with 2(!) rows. For dataframes with 30+ rows our job fails with an StackOverflowError when using the join method six times (as shown below). Joining only two dataframes works. All dataframes have the exact same number of rows, but vary in number of columns (between 20 and 100 columns).

result_table = kpi_results[0]["data"].join(kpi_results[1]["data"], on=join_columns, how="inner")
.join(kpi_results[2]["data"], on=join_columns, how="inner")
.join(kpi_results[3]["data"], on=join_columns, how="inner")
.join(kpi_results[4]["data"], on=join_columns, how="inner")
.join(kpi_results[5]["data"], on=join_columns, how="inner")
.join(kpi_results[6]["data"], on=join_columns, how="inner")
.na.fill(0)

kpi_results is a list of dictionaries where the data key holds a dataframes and another key names which columns, separated by business logic, are in this dataframe. Since this code works for a dataframe containing 2 rows, this should not be an issue.

We are using 12 DPUs on a G.2X worker type with the following configs set: —conf spark.driver.maxResultSize=28g —conf spark.driver.memory=28g —conf spark.executor.pyspark.memory=28g —conf spark.executor.memory=28g —conf spark.executor.extraJavaOptions=-Xss512m —conf spark.driver.extraJavaOptions=-Xss512m

SparkUI indicates that there isn’t any memory issue, since every executor has enough memory and the input data is 26 MiB.

Julian
질문됨 7달 전358회 조회
1개 답변
0

The stacktrace will give you key information about what is overflowing and if it is planning (more likely) or execution.
Often you can workaround that just by saving the joined DF on a variable and then joining that instead of chaining many joins with many columns.

profile pictureAWS
전문가
답변함 7달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠