1개 답변
- 최신
- 최다 투표
- 가장 많은 댓글
0
Here are some tips that might help you enhance the efficiency of your ETL jobs:
- Partition your data based on relevant keys that are often queried or filtered upon.
- Bucketing can also be useful for distributing data across multiple files in a more organized manner, especially for joins on large tables.
- Adjust Spark configurations to optimize for memory and CPU usage.
- Use efficient columnar storage formats like Parquet or ORC. These formats are highly optimized for read performance and compression.
- When using JDBC to fetch data, increase fetch size to reduce the number of round trips to the database.
- Ensure that data is evenly distributed across partitions to avoid data skewness, which can lead to performance bottlenecks.
- Cache intermediate datasets that are reused during the computation.
- Depending on your workload, consider scaling instances.
- Ensure that your database is optimized for read performance.
- If possible, split your data extraction process into multiple parallel reads.
If this has answered your question or was helpful, accepting the answer would be greatly appreciated. Thank you!
관련 콘텐츠
- AWS 공식업데이트됨 2년 전
Thank you for your answer, it is helpful indeed. I am already doing some of your point. Could you elaborate on the spark configuration ? Also, I have difficulties to find how many rows per partition I should have. I'm building a dynamic system meaning I have a function that count the number of source rows and create the partition accordingly. Should the partition have 1 millions rows each for huge table for example ?