Configure AWS Glue Spark shuffle plugin with Amazon S3 in the code

0

Can I specify the S3 bucket where you write shuffle files with the "AWS Glue Spark shuffle plugin with Amazon S3" (https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-shuffle-manager.html) also in the (PySpark) code of my Glue job or only via the job parameter --conf spark.shuffle.glue.s3ShuffleBucket=s3://<shuffle-bucket>. It looks like

spark_config = pyspark.conf.SparkConf()
spark_config.set("spark.shuffle.glue.s3ShuffleBucket", f"s3://{shuffle_data_bucket}/")
...
spark_context = pyspark.context.SparkContext(conf=spark_config)
glue_context = awsglue.context.GlueContext(spark_context)

does not do the job as with other Glue/Spark settings.

已提問 5 個月前檢視次數 305 次
1 個回答
0
已接受的答案

I found that it works as explaned here: https://docs.aws.amazon.com/glue/latest/dg/cloud-shuffle-storage-plugin.html.

spark_config.set("spark.shuffle.storage.path", f"s3://{shuffle_data_bucket}/")
spark_config.set("spark.shuffle.sort.io.plugin.class", "com.amazonaws.spark.shuffle.io.cloud.ChopperPlugin")
已回答 5 個月前
profile picture
專家
已審閱 2 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南