I'm trying to create an isolated Python virtual environment to package Python libraries necessary for a Pyspark job.
I was successful to make it work by simply following these steps https://github.com/aws-samples/emr-serverless-samples/tree/main/examples/pyspark/dependencies
However, there is one Python library dependency (SWIG) failing to install because it requires additional libs to be installed such as gcc gcc-c++ python3-devel.
LIB: https://github.com/51Degrees/Device-Detection/tree/master/python
So I added RUN yum install -y gcc gcc-c++ python3-devel to the Dockerfile image https://github.com/aws-samples/emr-serverless-samples/blob/main/examples/pyspark/dependencies/Dockerfile and it installed sucessfully and then I packaged the virtual env.
However, the emr job fails with that lib python modules not being found, which makes me think that python3-devel is not present in EMR Serverless 6.6.0
Since I don't have control over the serverless environment, is any way around this? Or am I missing something?
stderr
An error occurred while calling o198.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 19 in stage 0.0 failed 4 times, most recent failure: Lost task 19.3 in stage 0.0 (TID 89) ([2600:1f18:153d:6601:bfcc:6ff:50bc:240e] executor 7): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/hadoop/environment/lib64/python3.7/site-packages/FiftyOneDegrees/fiftyone_degrees_mobile_detector_v3_wrapper.py", line 15, in swig_import_helper
return importlib.import_module(mname)
File "/usr/lib64/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'FiftyOneDegrees._fiftyone_degrees_mobile_detector_v3_wrapper'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 619, in main
process()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 611, in process
serializer.dump_stream(out_iter, outfile)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "./jobs.zip/jobs/parsed_events_orc_processor/etl.py", line 360, in enrich_events
event['device'] = calculate_device_data(event)
File "./jobs.zip/jobs/parsed_events_orc_processor/etl.py", line 152, in calculate_device_data
device_data = mobile_detector.match(user_agent)
File "/home/hadoop/environment/lib64/python3.7/site-packages/fiftyone_degrees/mobile_detector/__init__.py", line 225, in match
else settings.DETECTION_METHOD)
File "/home/hadoop/environment/lib64/python3.7/site-packages/fiftyone_degrees/mobile_detector/__init__.py", line 63, in instance
cls._INSTANCES[method] = cls._METHODS[method]()
File "/home/hadoop/environment/lib64/python3.7/site-packages/fiftyone_degrees/mobile_detector/__init__.py", line 98, in __init__
from FiftyOneDegrees import fiftyone_degrees_mobile_detector_v3_wrapper
File "/home/hadoop/environment/lib64/python3.7/site-packages/FiftyOneDegrees/fiftyone_degrees_mobile_detector_v3_wrapper.py", line 18, in <module>
_fiftyone_degrees_mobile_detector_v3_wrapper = swig_import_helper()
File "/home/hadoop/environment/lib64/python3.7/site-packages/FiftyOneDegrees/fiftyone_degrees_mobile_detector_v3_wrapper.py", line 17, in swig_import_helper
return importlib.import_module('_fiftyone_degrees_mobile_detector_v3_wrapper')
File "/usr/lib64/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
ModuleNotFoundError: No module named '_fiftyone_degrees_mobile_detector_v3_wrapper'
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:545)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:703)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:685)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:498)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown Source)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:954)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:142)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:133)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1474)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2559)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2508)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2507)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2507)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1149)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1149)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1149)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2747)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2689)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2678)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.checkNoFailures(AdaptiveExecutor.scala:154)
at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.doRun(AdaptiveExecutor.scala:88)
at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.tryRunningAndGetFuture(AdaptiveExecutor.scala:66)
at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.execute(AdaptiveExecutor.scala:57)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:241)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:240)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:509)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:471)
at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3053)
at org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3052)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3770)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3768)
at org.apache.spark.sql.Dataset.count(Dataset.scala:3052)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/hadoop/environment/lib64/python3.7/site-packages/FiftyOneDegrees/fiftyone_degrees_mobile_detector_v3_wrapper.py", line 15, in swig_import_helper
return importlib.import_module(mname)
File "/usr/lib64/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'FiftyOneDegrees._fiftyone_degrees_mobile_detector_v3_wrapper'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 619, in main
process()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 611, in process
serializer.dump_stream(out_iter, outfile)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "./jobs.zip/jobs/parsed_events_orc_processor/etl.py", line 360, in enrich_events
event['device'] = calculate_device_data(event)
File "./jobs.zip/jobs/parsed_events_orc_processor/etl.py", line 152, in calculate_device_data
device_data = mobile_detector.match(user_agent)
File "/home/hadoop/environment/lib64/python3.7/site-packages/fiftyone_degrees/mobile_detector/__init__.py", line 225, in match
else settings.DETECTION_METHOD)
File "/home/hadoop/environment/lib64/python3.7/site-packages/fiftyone_degrees/mobile_detector/__init__.py", line 63, in instance
cls._INSTANCES[method] = cls._METHODS[method]()
File "/home/hadoop/environment/lib64/python3.7/site-packages/fiftyone_degrees/mobile_detector/__init__.py", line 98, in __init__
from FiftyOneDegrees import fiftyone_degrees_mobile_detector_v3_wrapper
File "/home/hadoop/environment/lib64/python3.7/site-packages/FiftyOneDegrees/fiftyone_degrees_mobile_detector_v3_wrapper.py", line 18, in <module>
_fiftyone_degrees_mobile_detector_v3_wrapper = swig_import_helper()
File "/home/hadoop/environment/lib64/python3.7/site-packages/FiftyOneDegrees/fiftyone_degrees_mobile_detector_v3_wrapper.py", line 17, in swig_import_helper
return importlib.import_module('_fiftyone_degrees_mobile_detector_v3_wrapper')
File "/usr/lib64/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
ModuleNotFoundError: No module named '_fiftyone_degrees_mobile_detector_v3_wrapper'
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:545)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:703)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:685)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:498)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution
Could you post the full error message and the script that you're using as your job?
I tried this but got an error message about being unable to load the database file:
Just added stderr output, tell me if that helps!
Thanks!
Can you confirm your virtualenv tar file has the module in it?
This command (change
pyspark_ge.tar.gz
if you renamed it in the Dockerfile):should show output like this:
I was able to get it to work with a simple script, which I'll post below.