1 回答
- 最新
- 投票最多
- 评论最多
1
In pyspark, to write the same dataframe to multiple locations, you need to have two write statements but the distribution to partitions is the costly action hence the slowness. Efficient way is to copy the output from OUTPUT_LOCATION_1 to OUTPUT_LOCATION_2 outside of pyspark through cp. In spark, you can try to repartition with a specified number(example:5) before writing to see if helps the performance with two write statements.
result.repartition(5).write.partitionBy("col2").mode("append").parquet(f"{OUTPUT_LOCATION_1}/end_date={event_end_date}")
相关内容
- AWS 官方已更新 3 年前