Unable to load data to apache in EMR cluster notebook

0

I am running an EMR cluster with an attached notebook, and using Apache spark to load/process data however I have not been able to load data into Apache. Whenever I try to run spark.read.csv('s3://bucket/pathtofile') the spark job just runs and does not load the data. I am using a small csv file for testing so it should not take long to load. Testing with boto3 I have been able to download the test csv file to the tmp folder of the EMR cluster. I have also been unable to load local files into Apache on the EMR cluster, with the same infinite loading issue. Any guidance on what I could be doing wrong or where to look would be greatly appreciated.

已提问 25 天前325 查看次数
1 回答
2

Hello,

Can you share if any issue reported after running the below sample wordcount step which upload the output to s3 bucket.

  1. Please check if the spark application created successfully,
  2. Check if the service role/execution role has permission to write/read data from s3.
  3. Please share the steps you followed to produce the outcome. (P.S., Please make sure to not include any sensitive information like clusterid, bucketname or anything specific to your account).

https://github.com/aws-samples/emr-studio-notebook-examples/blob/main/examples/word-count-with-pyspark.ipynb

AWS
支持工程师
已回答 25 天前
  • Hello, thank you for your response. I tried loading the sample wordcount notebook you suggested but it was unable to complete the first step of loading the dataframe. I was using m4xlarge for my cluster, and figured I would try scaling up to m5xlarge just to see if that would fix anything and it appears that it did. Not sure why spark was not working properly on the m4xlarge, it seemed to hang on any spark job step and would not complete even given a half hour of time for a simple task like creating that small dataframe.

  • Thanks for confirming that it worked in higher instance type. There could be two things I can relate. 1. The spark job might not used full resources available on the cluster. 2. The spark job needs more resources which might not be suitable with lower version. In order to confirm that, please leverage cloudwatch metrics[1] to see if they are fully utilised and also analyze the instance-state logs which gives an idea of individual instance resource consumption at the time of execution.[2]

    References:

    [1] - https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html

    [2] - https://repost.aws/articles/AR77wVn54aSQSjLzJGTQsKEQ/decoding-instance-state-log-in-emr

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则