1 Resposta
- Mais recentes
- Mais votos
- Mais comentários
1
In pyspark, to write the same dataframe to multiple locations, you need to have two write statements but the distribution to partitions is the costly action hence the slowness. Efficient way is to copy the output from OUTPUT_LOCATION_1 to OUTPUT_LOCATION_2 outside of pyspark through cp. In spark, you can try to repartition with a specified number(example:5) before writing to see if helps the performance with two write statements.
result.repartition(5).write.partitionBy("col2").mode("append").parquet(f"{OUTPUT_LOCATION_1}/end_date={event_end_date}")
Conteúdo relevante
- AWS OFICIALAtualizada há 2 meses
- AWS OFICIALAtualizada há 2 anos