EMR Serverless 6.6.0 Python SWIG Lib dependency

Question

I'm trying to create an isolated Python virtual environment to package Python libraries necessary for a Pyspark job.

I was successful to make it work by simply following these steps https://github.com/aws-samples/emr-serverless-samples/tree/main/examples/pyspark/dependencies

However, there is one Python library dependency (SWIG) failing to install because it requires additional libs to be installed such as gcc gcc-c++ python3-devel.
LIB: https://github.com/51Degrees/Device-Detection/tree/master/python

So I added RUN yum install -y gcc gcc-c++ python3-devel to the Dockerfile image https://github.com/aws-samples/emr-serverless-samples/blob/main/examples/pyspark/dependencies/Dockerfile and it installed sucessfully and then I packaged the virtual env.

However, the emr job fails with that lib python modules not being found, which makes me think that python3-devel is not present in EMR Serverless 6.6.0

Since I don't have control over the serverless environment, is any way around this? Or am I missing something?

stderr

```
An error occurred while calling o198.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 19 in stage 0.0 failed 4 times, most recent failure: Lost task 19.3 in stage 0.0 (TID 89) ([2600:1f18:153d:6601:bfcc:6ff:50bc:240e] executor 7): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/hadoop/environment/lib64/python3.7/site-packages/FiftyOneDegrees/fiftyone_degrees_mobile_detector_v3_wrapper.py", line 15, in swig_import_helper
    return importlib.import_module(mname)
  File "/usr/lib64/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "", line 1006, in _gcd_import
  File "", line 983, in _find_and_load
  File "", line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'FiftyOneDegrees._fiftyone_degrees_mobile_detector_v3_wrapper'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 619, in main
    process()
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 611, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "./jobs.zip/jobs/parsed_events_orc_processor/etl.py", line 360, in enrich_events
    event['device'] = calculate_device_data(event)
  File "./jobs.zip/jobs/parsed_events_orc_processor/etl.py", line 152, in calculate_device_data
    device_data = mobile_detector.match(user_agent)
  File "/home/hadoop/environment/lib64/python3.7/site-packages/fiftyone_degrees/mobile_detector/__init__.py", line 225, in match
    else settings.DETECTION_METHOD)
  File "/home/hadoop/environment/lib64/python3.7/site-packages/fiftyone_degrees/mobile_detector/__init__.py", line 63, in instance
    cls._INSTANCES[method] = cls._METHODS[method]()
  File "/home/hadoop/environment/lib64/python3.7/site-packages/fiftyone_degrees/mobile_detector/__init__.py", line 98, in __init__
    from FiftyOneDegrees import fiftyone_degrees_mobile_detector_v3_wrapper
  File "/home/hadoop/environment/lib64/python3.7/site-packages/FiftyOneDegrees/fiftyone_degrees_mobile_detector_v3_wrapper.py", line 18, in 
    _fiftyone_degrees_mobile_detector_v3_wrapper = swig_import_helper()
  File "/home/hadoop/environment/lib64/python3.7/site-packages/FiftyOneDegrees/fiftyone_degrees_mobile_detector_v3_wrapper.py", line 17, in swig_import_helper
    return importlib.import_module('_fiftyone_degrees_mobile_detector_v3_wrapper')
  File "/usr/lib64/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
ModuleNotFoundError: No module named '_fiftyone_degrees_mobile_detector_v3_wrapper'

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:545)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:703)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:685)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:498)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown Source)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:954)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:142)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:133)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1474)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2559)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2508)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2507)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2507)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1149)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1149)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1149)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2747)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2689)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2678)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.checkNoFailures(AdaptiveExecutor.scala:154)
        at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.doRun(AdaptiveExecutor.scala:88)
        at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.tryRunningAndGetFuture(AdaptiveExecutor.scala:66)
        at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.execute(AdaptiveExecutor.scala:57)
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:241)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:240)
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:509)
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:471)
        at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3053)
        at org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3052)
        at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3770)
        at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
        at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
        at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
        at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
        at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3768)
        at org.apache.spark.sql.Dataset.count(Dataset.scala:3052)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/hadoop/environment/lib64/python3.7/site-packages/FiftyOneDegrees/fiftyone_degrees_mobile_detector_v3_wrapper.py", line 15, in swig_import_helper
    return importlib.import_module(mname)
  File "/usr/lib64/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "", line 1006, in _gcd_import
  File "", line 983, in _find_and_load
  File "", line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'FiftyOneDegrees._fiftyone_degrees_mobile_detector_v3_wrapper'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 619, in main
    process()
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 611, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "./jobs.zip/jobs/parsed_events_orc_processor/etl.py", line 360, in enrich_events
    event['device'] = calculate_device_data(event)
  File "./jobs.zip/jobs/parsed_events_orc_processor/etl.py", line 152, in calculate_device_data
    device_data = mobile_detector.match(user_agent)
  File "/home/hadoop/environment/lib64/python3.7/site-packages/fiftyone_degrees/mobile_detector/__init__.py", line 225, in match
    else settings.DETECTION_METHOD)
  File "/home/hadoop/environment/lib64/python3.7/site-packages/fiftyone_degrees/mobile_detector/__init__.py", line 63, in instance
    cls._INSTANCES[method] = cls._METHODS[method]()
  File "/home/hadoop/environment/lib64/python3.7/site-packages/fiftyone_degrees/mobile_detector/__init__.py", line 98, in __init__
    from FiftyOneDegrees import fiftyone_degrees_mobile_detector_v3_wrapper
  File "/home/hadoop/environment/lib64/python3.7/site-packages/FiftyOneDegrees/fiftyone_degrees_mobile_detector_v3_wrapper.py", line 18, in 
    _fiftyone_degrees_mobile_detector_v3_wrapper = swig_import_helper()
  File "/home/hadoop/environment/lib64/python3.7/site-packages/FiftyOneDegrees/fiftyone_degrees_mobile_detector_v3_wrapper.py", line 17, in swig_import_helper
    return importlib.import_module('_fiftyone_degrees_mobile_detector_v3_wrapper')
  File "/usr/lib64/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
ModuleNotFoundError: No module named '_fiftyone_degrees_mobile_detector_v3_wrapper'

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:545)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:703)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:685)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:498)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution
```

Answer

This should be possible, but as you mentioned, requires some extra Linux packages. I was able to get this working for me with one additional change - the Device-Detection library tries to load a `.dat` file from a specific location in the user's home directory that won't exist in the virtualenv package, but luckily it can be specified manually.

Here's my Dockerfile with the extra `yum` packages as well as a step where I manually download the `.dat` file.

```dockerfile
FROM amazonlinux:2 AS base

RUN yum install -y python3 gcc gcc-c++ python3-devel

ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

RUN python3 -m pip install --upgrade pip && \
    python3 -m pip install \
    51degrees-mobile-detector-v3-wrapper==3.2.18.4 \
    venv-pack==0.2.0

# Download the data file
RUN curl --output-dir ${VIRTUAL_ENV} -LO https://github.com/51Degrees/Device-Detection/raw/master/data/51Degrees-LiteV3.2.dat

RUN mkdir /output && venv-pack -o /output/pyspark_so.tar.gz

FROM scratch AS export
COPY --from=base /output/pyspark_so.tar.gz /
```

Here's my sample script - note that I have to create the provider manually by specifying the `.dat` file in `/home/hadoop/environment` (where the virtualenv tar gets unpacked) and I try to do a detection both with and without Spark loaded.

```python
# Default mobile_detector looks in /home/hadoop/51Degrees/51Degrees-LiteV3.2.dat
# So instead we load settings and the wrapper and create the Provider manually
# from fiftyone_degrees import mobile_detector
from fiftyone_degrees.mobile_detector.conf import settings
from FiftyOneDegrees import fiftyone_degrees_mobile_detector_v3_wrapper
from pyspark.sql import SparkSession

provider = fiftyone_degrees_mobile_detector_v3_wrapper.Provider(
    "/home/hadoop/environment/51Degrees-LiteV3.2.dat",
    settings.PROPERTIES,
    settings.CACHE_SIZE,
    settings.POOL_SIZE,
)
device = provider.getMatch(
    "Mozilla/5.0 (iPad; CPU OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Mobile/9B176"
)

print(device.getValue("BrowserName"))

if __name__ == "__main__":
    spark = SparkSession.builder.appName("NativeModules").getOrCreate()

device = provider.getMatch(
        "Mozilla/5.0 (iPad; CPU OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Mobile/9B176"
    )
    print(device.getValue("BrowserName"))
```

And then this is the command used to run the job. Note the usage of all the different `sparkSubmitParameters`. These are all required for this to work properly.

```shell
aws emr-serverless start-job-run \
    --application-id $APPLICATION_ID \
    --execution-role-arn $JOB_ROLE_ARN \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://'${S3_BUCKET}'/code/pyspark/native_mod.py",
            "sparkSubmitParameters": "--conf spark.archives=s3://'${S3_BUCKET}'/artifacts/pyspark/pyspark_so.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python"
        }
    }' \
    --configuration-overrides '{
        "monitoringConfiguration": {
            "s3MonitoringConfiguration": {
                "logUri": "s3://'${S3_BUCKET}'/logs/"
            }
        }
    }'
```

Answer

Hi,

Thank you for posting your query to repost.
I have passed on this information to @dacort the author of the github post.

Please share the error in stderr and stdout file and let us know if you are using ge_profile.py script to run the application
Please ensure that there is not sensitive information in the logs shared

EMR Serverless 6.6.0 Python SWIG Lib dependency

Relevant content