Unable to load data to apache in EMR cluster notebook

0

I am running an EMR cluster with an attached notebook, and using Apache spark to load/process data however I have not been able to load data into Apache. Whenever I try to run spark.read.csv('s3://bucket/pathtofile') the spark job just runs and does not load the data. I am using a small csv file for testing so it should not take long to load. Testing with boto3 I have been able to download the test csv file to the tmp folder of the EMR cluster. I have also been unable to load local files into Apache on the EMR cluster, with the same infinite loading issue. Any guidance on what I could be doing wrong or where to look would be greatly appreciated.

asked 12 days ago128 views
1 Answer
2

Hello,

Can you share if any issue reported after running the below sample wordcount step which upload the output to s3 bucket.

  1. Please check if the spark application created successfully,
  2. Check if the service role/execution role has permission to write/read data from s3.
  3. Please share the steps you followed to produce the outcome. (P.S., Please make sure to not include any sensitive information like clusterid, bucketname or anything specific to your account).

https://github.com/aws-samples/emr-studio-notebook-examples/blob/main/examples/word-count-with-pyspark.ipynb

AWS
SUPPORT ENGINEER
answered 12 days ago
  • Hello, thank you for your response. I tried loading the sample wordcount notebook you suggested but it was unable to complete the first step of loading the dataframe. I was using m4xlarge for my cluster, and figured I would try scaling up to m5xlarge just to see if that would fix anything and it appears that it did. Not sure why spark was not working properly on the m4xlarge, it seemed to hang on any spark job step and would not complete even given a half hour of time for a simple task like creating that small dataframe.

  • Thanks for confirming that it worked in higher instance type. There could be two things I can relate. 1. The spark job might not used full resources available on the cluster. 2. The spark job needs more resources which might not be suitable with lower version. In order to confirm that, please leverage cloudwatch metrics[1] to see if they are fully utilised and also analyze the instance-state logs which gives an idea of individual instance resource consumption at the time of execution.[2]

    References:

    [1] - https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html

    [2] - https://repost.aws/articles/AR77wVn54aSQSjLzJGTQsKEQ/decoding-instance-state-log-in-emr

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions