Running spark job from AWS Lambda

Question

I would like to get data from IceBerg table using AWS Lambda. I was able to create all the code and containers only to discover that AWS Lambda doesn't allow process substitution that spark uses here: https://github.com/apache/spark/blob/121f9338cefbb1c800fabfea5152899a58176b00/bin/spark-class#L92

The error is: /usr/local/lib/python3.10/dist-packages/pyspark/bin/spark-class: line 92: /dev/fd/63: No such file or directory

Do you maybe have some idea how this can be solved?

The error is:
/usr/local/lib/python3.10/dist-packages/pyspark/bin/spark-class: line 92: /dev/fd/63: No such file or directory

Do you maybe have some idea how this can be solved?

Answer

Hi,

You can run it for sure. The thing is that you can´t initialise the env that way.

See this post for inspiration.

https://plainenglish.io/blog/spark-on-aws-lambda-c65877c0ac96

If not, my suggestion here is to change to a Java based Lambda ( Spark and Iceberg are based on Java Virtual Machine).

Bests.

Answer

Hi,

not sure I understand your use case, if you just need to read the Iceberg table from the Lambda Function, a better option might be to invoke a query in Athena using pyAthena or the [aws-sdk-pandas](https://aws-sdk-pandas.readthedocs.io/en/stable/), which is provided as a Lambda Layer as well.

if you are trying to run  some specific spark jobs and you would like the semplicity of a serverless function you might want to have a look at [AWS Glue](https://aws.amazon.com/glue/).

hope this helps

Running spark job from AWS Lambda

Relevant content