Unable to load data to apache in EMR cluster notebook

Question

I am running an EMR cluster with an attached notebook, and using Apache spark to load/process data however I have not been able to load data into Apache. Whenever I try to run spark.read.csv('s3://bucket/pathtofile') the spark job just runs and does not load the data. I am using a small csv file for testing so it should not take long to load. Testing with boto3 I have been able to download the test csv file to the tmp folder of the EMR cluster. I have also been unable to load local files into Apache on the EMR cluster, with the same infinite loading issue. Any guidance on what I could be doing wrong or where to look would be greatly appreciated.

Answer

Hello,

Can you share if any issue reported after running the [below sample wordcount](https://github.com/aws-samples/emr-studio-notebook-examples/blob/main/examples/word-count-with-pyspark.ipynb) step which upload the output to s3 bucket. 
1. Please check if the spark application created successfully,
2. Check if the service role/execution role has permission to write/read data from s3. 
3. Please share the steps you followed to produce the outcome. (P.S., Please make sure to not include any sensitive information like clusterid, bucketname or anything specific to your account).

https://github.com/aws-samples/emr-studio-notebook-examples/blob/main/examples/word-count-with-pyspark.ipynb

Unable to load data to apache in EMR cluster notebook

相关内容