Configure AWS Glue Spark shuffle plugin with Amazon S3 in the code

0

Can I specify the S3 bucket where you write shuffle files with the "AWS Glue Spark shuffle plugin with Amazon S3" (https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-shuffle-manager.html) also in the (PySpark) code of my Glue job or only via the job parameter --conf spark.shuffle.glue.s3ShuffleBucket=s3://<shuffle-bucket>. It looks like

spark_config = pyspark.conf.SparkConf()
spark_config.set("spark.shuffle.glue.s3ShuffleBucket", f"s3://{shuffle_data_bucket}/")
...
spark_context = pyspark.context.SparkContext(conf=spark_config)
glue_context = awsglue.context.GlueContext(spark_context)

does not do the job as with other Glue/Spark settings.

preguntada hace 5 meses304 visualizaciones
1 Respuesta
0
Respuesta aceptada

I found that it works as explaned here: https://docs.aws.amazon.com/glue/latest/dg/cloud-shuffle-storage-plugin.html.

spark_config.set("spark.shuffle.storage.path", f"s3://{shuffle_data_bucket}/")
spark_config.set("spark.shuffle.sort.io.plugin.class", "com.amazonaws.spark.shuffle.io.cloud.ChopperPlugin")
respondido hace 5 meses
profile picture
EXPERTO
revisado hace 2 meses

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas