- Newest
- Most votes
- Most comments
Hi ,
Small disclaimer: I do not have tested it, so my theory is not proven.
My understanding is that you are repartitioning the data to 1 partition (to have 1 file) using the repartition or coalesce command.
Now you have to consider that Spark run in a distributed cluster and each partition is managed by a different executor so in a normal execution when you are reading the data from Oracle even if it is sorted during the ingestion it may be split and re-merged after without conserving the sorting order. This is why without Autoscaling checked the data is not sorted.
Now , when Autoscaling is enabled you are telling Glue to start only the number of executors are actually needed. This combined with Spark Lazy evaluation and your repartition(1) could bring glue to start only one executor and thus read and write the data in your sorted order.
To validate it you could look at the Spark UI for the 2 jobs and see how many executor are running at anytime during the Job.
hope this helps,
Relevant content
- asked 3 years ago
- asked a year ago
- Accepted Answerasked a year ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
Thank you for your reply. Your explanation was a great help.