Spark shuffle huge amount of data even read data is not huge

Reading few gb say 15gb of parquet skewed data , after few transformation such as data type change for some columns and then doing repartitions (dataframe.repartition(120)) before writing it to s3 in csv gzip format results in huge amount of shuffle writes as can be seen in spark UI though Input data size is 15 gb , shuffle write is 600gbs

Interested to know why its happening ?

Topics

Analytics Database Storage

Tags

AWS Glue Extract Transform & Load Data Amazon GameSparks S3 Select

Language

English

Bibhu

asked 23 days ago263 views

1 Answer

Newest
Most votes
Most comments

Are these answers helpful? Upvote the correct answer to help the community benefit from your knowledge.

That number is normally larger (e.g. 2x) because it compresses rows while parquet columnar compression is much more efficient.
Must mean in your data there are many columns with repeated values.

EXPERT

Gonzalo Herreros

answered 22 days ago

Bibhu
21 days ago
Can you please explain a bit more. I dont have any repeated values as such but few values are nulls.

How we can optimise .Because without repartition I tried writing to s3 in csv format ,its 500 GB of data
Gonzalo Herreros EXPERT
21 days ago
Avoid the shuffle if you can, otherwise don't worry too much about the amount, the transfer is quite fast
Bibhu
21 days ago
Data is skewed, so using repartition to distribute the data evenly which is resulting in huge shuffle writes. Even without repartition it is taking around 1 hr to complete with G2.x and 60 DPUs
Bibhu
19 days ago
These parquet data is being read from Glue Catalog Tables directly.

Relevant content

calculating time for migrating huge amount of data using S3 batch replication
Irfan
asked 10 months ago
Cost analysis for replication of data
Accepted Answer
Abhijeet
asked 4 months ago
Error reading data with Athena version 3.0 for some fields
rmis
asked 2 months ago
Migrating huge amount of data using s3 batch replication
Accepted Answer
Irfan
asked 10 months ago
How do I change the number of open shards in Kinesis Data Streams?
AWS OFFICIALUpdated 5 months ago
How can I copy large amounts of data from Amazon S3 into HDFS on my Amazon EMR cluster?
AWS OFFICIALUpdated 2 years ago
How do I use pivoted data after an AWS Glue relationalize transformation?
AWS OFFICIALUpdated 3 years ago
Why do I get errors when I try to read JSON data in Amazon Athena?
AWS OFFICIALUpdated 2 months ago
Automated change data capture (CDC) data ingestion from DynamoDB to Redshift
EXPERT
Eesha Kumar
published 7 months ago
Use generic logic to manage data warehouse change data capture (CDC) in Amazon Redshift
EXPERT
Sean Beath
published 3 months ago