Questions tagged with Amazon Elastic MapReduce

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Is there a way to utilize EMR Serverless to run S3DistCp? Looking at the base Docker images, I can see that the `s3-dist-cp` command is included in the Hive image. How can I submit a job run that runs it? Is this even supported - or planned to be supported in the future? Thanks
1
answers
0
votes
21
views
nikos64
asked 2 days ago
Hi One point I don't understand about EMR notebook : this tool is mainly made for developers for which we don't want to allow to connect to the AWS console... How to provide EMR notebook without a AWS console ? ... this is easy with a jupyter managed inside EMR but with less possibilities. or How to configure a very restricted AWS console, with only the possibility to start a EMR notebook and open jupyterlab that is provided ?
1
answers
0
votes
27
views
asked 8 days ago
When creating an emr cluster from airflow or manually from the EMR panel, it remains in the starting state and after approximately an hour the cluster ends with errors and the only detail it shows is an internal error and does not give any more detail, nor are the instances created, they remain in the provisioning state
1
answers
0
votes
27
views
asked 17 days ago
Trying to use HUE as a web interface hosted on EMR server to issue HIVE QL. The file connection works fine -- can explore S3 files no problem (which probably doesn't require controlled core machines.) But any attempt to use HIVE QL to create tables (which probably does require controlled core machines for efficiency) results in remote procedure call error: "java.net.NoRouteToHostException No Route to Host from ip-xxx-xx-xx-xxx.us-west-1.compute.internal/172.31.29.217 to ip-yyy-yy-yy-yyy.us-west-1.compute.internal:8020 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; But according to emr service port listing (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-service-ports.html) 8020 is used by namenode for rpc and is started automatically -- users should not set up the port for access; violates security. So what can I do to fix this error?
2
answers
0
votes
24
views
asked 19 days ago
Hi All, I am creating an EMR cluster programmatically and calling the start_notebook_execution and attaching that cluster to the notebook. This works fine when I do manually. But programmatically the start_notebook_execution is failing with below response. I am unable to locate the logs for this and debug further. Please help 'StartTime': datetime.datetime(2023, 1, 8, 21, 19, 21, 610000, tzinfo=tzlocal()), 'Arn': 'arn:aws:elasticmapreduce:us-east-1:account-number:notebook-execution/ex-J05JCMWBMEEUD1AE2C3MIB3CBSG0E', 'LastStateChangeReason': 'Execution has failed for cluster j-1ZYESBPM8DZMZ. Internal error',
0
answers
0
votes
9
views
asked 20 days ago
Hello, By default Glue run one executor per worker. I want to run more executors in worker. I have set following spark configuration in Glue params but It didn't work. `--conf : spark.executor.instances=10` Let's say I have 5 G.2X workers. In that case It starts 4 executors because 1 will be reserved for driver. I can see list of all 4 executors in Spark UI. But above configuration does not increase executors at all. I'm getting following warning in driver logs. Seems like glue.ExecutorTaskManagement controlling number of executors. `WARN [allocator] glue.ExecutorTaskManagement (Logging.scala:logWarning(69)): executor task creation failed for executor 5, restarting within 15 secs. restart reason: Executor task resource limit has been temporarily hit` Any help would be appreciated. Thanks!
1
answers
0
votes
18
views
asked a month ago
I would like to get data from IceBerg table using AWS Lambda. I was able to create all the code and containers only to discover that AWS Lambda doesn't allow process substitution that spark uses here: https://github.com/apache/spark/blob/121f9338cefbb1c800fabfea5152899a58176b00/bin/spark-class#L92 The error is: /usr/local/lib/python3.10/dist-packages/pyspark/bin/spark-class: line 92: /dev/fd/63: No such file or directory Do you maybe have some idea how this can be solved?
2
answers
0
votes
35
views
asked a month ago
Folks: I am running some code that uses a mix of PySpark (for data manipulation) and Python (for visualization). Very similar to this blog: https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/. The cluster I am using has all of the defaults: ``` Release label: emr-5.36.0 Hadoop distribution: Amazon 2.10.1 Applications:Spark 2.4.8, Livy 0.7.1, Hive 2.3.9, JupyterEnterpriseGateway 2.1.0 ``` The command: `sc.install_pypi_package("pandas")` seems to work successfully. However the command: `sc.install_pypi_package("matplotlib") `fails with an error on the Pillow dependency. The specific error is: ``` Building wheels for collected packages: unknown, unknown Running setup.py bdist_wheel for unknown: started Running setup.py bdist_wheel for unknown: finished with status 'error' Complete output from command /tmp/1669916958616-0/bin/python -u -c "import setuptools, tokenize;__file__='/mnt/tmp/pip-build-e28wksxd/pillow/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmprh77de26pip-wheel- --python-tag cp37: running bdist_wheel running build running build_ext The headers or library files could not be found for jpeg, a required dependency when compiling Pillow from source. ``` I logged into the Master node on the EMR cluster and attempted to install some of the libraries and Python compiler support using: ``` sudo yum install python3-devel redhat-rpm-config libtiff-devel libjpeg-devel openjpeg2-devel zlib-devel \ freetype-devel lcms2-devel libwebp-devel tcl-devel tk-devel \ harfbuzz-devel fribidi-devel libraqm-devel libimagequant-devel libxcb-devel ``` While the installation of pillow gets a bit further, there are still errors such as: ``` Collecting pillow Using cached https://files.pythonhosted.org/packages/16/11/da8d395299ca166aa56d9436e26fe8440e5443471de16ccd9a1d06f5993a/Pillow-9.3.0.tar.gz Building wheels for collected packages: unknown, unknown Running setup.py bdist_wheel for unknown: started Running setup.py bdist_wheel for unknown: finished with status 'done' Stored in directory: /var/lib/livy/.cache/pip/wheels/55/5a/ad/9f708fd6d1500e9ff680e17b1c2f436e8439477a5a226611c6 Running setup.py bdist_wheel for unknown: started Running setup.py bdist_wheel for unknown: finished with status 'error' Complete output from command /tmp/1669918064446-0/bin/python -u -c "import setuptools, tokenize;__file__='/mnt/tmp/pip-build-jnpkka_0/unknown/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmpn9k3ctzopip-wheel- --python-tag cp37: Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/1669918064446-0/lib64/python3.7/tokenize.py", line 447, in open buffer = _builtin_open(filename, 'rb') FileNotFoundError: [Errno 2] No such file or directory: '/mnt/tmp/pip-build-jnpkka_0/unknown/setup.py' ``` I feel as if I am missing something very obvious. It can not possibly be this difficult to get a commonly used package like matplotlib to work. Any suggestions? Thanks Rich H.
0
answers
0
votes
40
views
asked 2 months ago
HIVE_UNSUPPORTED_FORMAT: Output format org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat with SerDe org.openx.data.jsonserde.JsonSerDe is not supported. If a data manifest file was generated at 's3://athena-one-output-bucket/Unsaved/2022/11/24/0a5467bf-8b9a-4119-bc89-c891d1e26744-manifest.csv', you may need to manually clean the data from locations specified in the manifest. Athena will not delete data in your account. This query ran against the "covid_dataset" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 0a5467bf-8b9a-4119-bc89-c891d1e26744
0
answers
0
votes
40
views
asked 2 months ago
I am working on enabling automatic patching for one of the EMR cluster. I understand that with Amazon EMR release 6.6 and later, when you launch new Amazon EMR clusters with the default Amazon Linux (AL) AMI option, Amazon EMR automatically uses the latest Amazon Linux AMI. So when I update/create a cluster with EMR release 6.8, it will use the latest Amazon Linux AMI at the time of boot but what is the solution if we have long lived cluster.
1
answers
0
votes
33
views
AWS
asked 2 months ago
Hi, I need to install Go packages that interact with my Spark script. Is it possible to do such things ?
2
answers
0
votes
131
views
mgtdi
asked 2 months ago
Trying to share data between two spark jobs in an EMR serverless application using temp or global temp views without having to write to s3 and then do read. It doesn't seem to work. What is the recommended approach?
0
answers
0
votes
30
views
syd
asked 3 months ago