How should i configure my emr cluster to handle large data

0

I have an EMR cluster and I have used the treasure data connector to read data from table into dataframe using pyspark. Now these tables that I'm trying to read have approximately 100 million to 500 million rows of data so whenever I'm trying to read data, that read is fast like it gets completed in few seconds but if I'm trying to do count or show operation it is taking an hour which is considerably a lot of time. I have configured my cluster as follows(please refer to the images):- Enter image description hereEnter image description here

Enter image description here

Can you help to debug the issue and configure cluster correctly

질문됨 한 달 전310회 조회
1개 답변
0

Hello,

Thank you for writing on re:Post.

I see that you want to know how you can improve the performance of your current EMR cluster running large datasets.

First of all, I would request you to tune your Spark memory parameters by using the below AWS Best Practices Guide - [+] https://aws.github.io/aws-emr-best-practices/docs/bestpractices/Applications/Spark/troubleshooting_and_tuning/ This will assist you in tuning the driver and memory parameters according to the instance type being used. As default executor memory of 8g looks less.

Secondly, I recommend you to read about using maximizeResourceAllocation. This will allow the executors to use the maximum resources possible on each node in a cluster. [+] https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#emr-spark-maximizeresourceallocation

Next, I ask you to check the Cloudwatch Metrics of the Cluster when the count or show is running to check if the resources are falling short causing the delay. Especially metrics like - ContainerPending, YARNMemoryAvailablePercentage. if they are showing high load, then you may need to increase the maximum cluster size in the Managed Scaling settings. Also the limits set in the Managed scaling is in units which is not as same as number of nodes. In InstanceFleet, every node has a weight which can be provided during cluster creation. Please check that and set the scaling limits accordingly.

I hope I was able to address your query. Thanks and have a great day ahead!

AWS
지원 엔지니어
답변함 한 달 전
profile picture
전문가
검토됨 한 달 전
AWS
지원 엔지니어
검토됨 한 달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인