Configure AWS Glue Spark shuffle plugin with Amazon S3 in the code

0

Can I specify the S3 bucket where you write shuffle files with the "AWS Glue Spark shuffle plugin with Amazon S3" (https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-shuffle-manager.html) also in the (PySpark) code of my Glue job or only via the job parameter --conf spark.shuffle.glue.s3ShuffleBucket=s3://<shuffle-bucket>. It looks like

spark_config = pyspark.conf.SparkConf()
spark_config.set("spark.shuffle.glue.s3ShuffleBucket", f"s3://{shuffle_data_bucket}/")
...
spark_context = pyspark.context.SparkContext(conf=spark_config)
glue_context = awsglue.context.GlueContext(spark_context)

does not do the job as with other Glue/Spark settings.

已提问 5 个月前304 查看次数
1 回答
0
已接受的回答

I found that it works as explaned here: https://docs.aws.amazon.com/glue/latest/dg/cloud-shuffle-storage-plugin.html.

spark_config.set("spark.shuffle.storage.path", f"s3://{shuffle_data_bucket}/")
spark_config.set("spark.shuffle.sort.io.plugin.class", "com.amazonaws.spark.shuffle.io.cloud.ChopperPlugin")
已回答 5 个月前
profile picture
专家
已审核 2 个月前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则