AWS Glue & pyspark : How to improve performance on a medium to big scaled table

Question

Hello team,

So, I built an ETL in python using pyspark. I have a bastion EC2 mysql database that is a copy of a production environment. 
Every day it is copying the prod at round 2 oclock, and my ETL process start : I copy the data from the mysql EC2 using this kindof function:             
`        try:
            tmp_data_frame = spark.read.format("jdbc") \
            .option("url", jdbc_url) \
            .option("user", username) \
            .option("password", password) \
            .option("query", query) \
            .load()
            dynamic_frame = DynamicFrame.fromDF(tmp_data_frame, self.glue_context)
            self._print_data_details(dynamic_frame)
            return dynamic_frame`

Because I have case where I need to do a join between two big tables in source to get the data that need to be extract. (ex: table 1 has id + lastupdate date, table 2 has relevant data and id). Or sometime I want to use query with where in it to reduce the size of the data to extract.

Thing is, I find the performance very disappointing. I've got 30 min extract time for a query with 250k rows. source table has idx. 
Even full extract of a 4.5millions rows can take up to 41min using a self.glue_context.create_dynamic_frame.from_catalog from the extract to load into a staging table, without any transformation.

Is there any good practice you guys would want to share that could be usefull for performance ?

Accepted Answer

Here are some tips that might help you enhance the efficiency of your ETL jobs:
* Partition your data based on relevant keys that are often queried or filtered upon.
* Bucketing can also be useful for distributing data across multiple files in a more organized manner, especially for joins on large tables.
* Adjust Spark configurations to optimize for memory and CPU usage.
* Use efficient columnar storage formats like Parquet or ORC. These formats are highly optimized for read performance and compression.
* When using JDBC to fetch data, increase fetch size to reduce the number of round trips to the database.
* Ensure that data is evenly distributed across partitions to avoid data skewness, which can lead to performance bottlenecks.
* Cache intermediate datasets that are reused during the computation. 
* Depending on your workload, consider scaling instances.
* Ensure that your database is optimized for read performance.
* If possible, split your data extraction process into multiple parallel reads.

If this has answered your question or was helpful, accepting the answer would be greatly appreciated. Thank you!

AWS Glue & pyspark : How to improve performance on a medium to big scaled table

관련 콘텐츠