Explore how you can quickly prepare for, respond to, and recover from security events. Learn more.
Questions tagged with Amazon EMR
Amazon EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto.
Content language: English
Select up to 5 tags to filter
Sort by most recent
Browse through the questions and answers listed below or filter and sort to narrow down your results.
317 results
I am trying to have glue data catalog as the hive metastore, stood up the EMR(emr-6.15.0) with the following node classification config per AWS, and it always initialize a default glue catalog databas...
So I define manually finishing using the RunJobFlow operator (https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html) `"KeepJobFlowAliveWhenNoSteps": True`. However, the cluster clust...
I would like to know the log4j configuration to get container logs into more structured format like Json, so I can leverage another automation to parse the files and train some customization to filter...
Hello,
I have upgraded the EMR from 6.14 to 6.15, and started seeing errors on the existing core node:
`org.apache.hadoop.fs.s3a.auth.NoAwsCredentialsException: IAMInstanceCredentialsProvider: Faile...
I am trying to connect to my documentDB trhough the spark-mongodb connector, but it looks like DocumentDB does not support Collstats. How disable the collstats command so i can do my transformations w...
How to add additional library i.e. databricks spark xml to a running EMR cluster and access it in Notebook
I am using emr-6.12.0 and trying to set environment varibles which are stored in secret manager in bootstrap.sh file.
```
SECRET_NAME="/myapp/dev/secrets"
SECRETS_JSON=$(aws secretsmanager get-secret...
I want my EMR cluster to be terminated automatically post an idle time.
I have configured 'Automatically terminate cluster after idle time' and set the idle time as '5 minutes' .
In my cluster i have ...
If my environment is full of Apache Hudi integrating with EMR and Lake Formation, I found out that Hudi environment is not very friendly to be used by Redshift nor Athena. There are many advanced feat...
My customer is using AWS EMR and is storing all the Hive meta data on an external RDS instance, using MySQL 5.7.* And since MySQL 5.7 is running out of its lifecycle, we are pushing them to upgrade t...
Everyday a new emr cluster span up and terminated after completing the step job. Checking the cloudtrail, seems a Data Pipeline created it. I am not sure how to get more details like who created, what...
I want to save my pyspark dataframe in RecordIO protobuf format. I am using Amazon EMR to run my pyspark scripts, and I want to use AWS SageMaker to train a machine learning model. SageMaker pipe mode...