Questions tagged with Amazon EMR

Amazon EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto.

Content language: English

Select up to 5 tags to filter
Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

317 results
I am trying to have glue data catalog as the hive metastore, stood up the EMR(emr-6.15.0) with the following node classification config per AWS, and it always initialize a default glue catalog databas...
Accepted AnswerAmazon EMRAWS Glue
1
answers
0
votes
840
views
asked a year ago
So I define manually finishing using the RunJobFlow operator (https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html) `"KeepJobFlowAliveWhenNoSteps": True`. However, the cluster clust...
1
answers
0
votes
259
views
asked a year ago
I would like to know the log4j configuration to get container logs into more structured format like Json, so I can leverage another automation to parse the files and train some customization to filter...
2
answers
0
votes
837
views
asked a year ago
Hello, I have upgraded the EMR from 6.14 to 6.15, and started seeing errors on the existing core node: `org.apache.hadoop.fs.s3a.auth.NoAwsCredentialsException: IAMInstanceCredentialsProvider: Faile...
1
answers
0
votes
426
views
asked a year ago
I am trying to connect to my documentDB trhough the spark-mongodb connector, but it looks like DocumentDB does not support Collstats. How disable the collstats command so i can do my transformations w...
1
answers
0
votes
774
views
asked a year ago
How to add additional library i.e. databricks spark xml to a running EMR cluster and access it in Notebook
1
answers
0
votes
367
views
asked a year ago
I am using emr-6.12.0 and trying to set environment varibles which are stored in secret manager in bootstrap.sh file. ``` SECRET_NAME="/myapp/dev/secrets" SECRETS_JSON=$(aws secretsmanager get-secret...
1
answers
0
votes
621
views
asked a year ago
I want my EMR cluster to be terminated automatically post an idle time. I have configured 'Automatically terminate cluster after idle time' and set the idle time as '5 minutes' . In my cluster i have ...
1
answers
0
votes
481
views
asked a year ago
If my environment is full of Apache Hudi integrating with EMR and Lake Formation, I found out that Hudi environment is not very friendly to be used by Redshift nor Athena. There are many advanced feat...
2
answers
0
votes
701
views
asked a year ago
My customer is using AWS EMR and is storing all the Hive meta data on an external RDS instance, using MySQL 5.7.* And since MySQL 5.7 is running out of its lifecycle, we are pushing them to upgrade t...
Accepted AnswerAmazon EMRMySQL
1
answers
1
votes
492
views
AWS
asked a year ago
Everyday a new emr cluster span up and terminated after completing the step job. Checking the cloudtrail, seems a Data Pipeline created it. I am not sure how to get more details like who created, what...
2
answers
1
votes
401
views
asked a year ago
I want to save my pyspark dataframe in RecordIO protobuf format. I am using Amazon EMR to run my pyspark scripts, and I want to use AWS SageMaker to train a machine learning model. SageMaker pipe mode...
1
answers
0
votes
342
views
asked a year ago