How to handle long running process (such as processing of 40GB video file) using lambda function?

0

Hi There :-),

Context:

  1. Two files will be uploaded to s3, 1.2. around 40GB avi format video file 1.2. around 100mB txt file
  2. Whenever there's an upload happens, I want to pick those both files and apply some transformation logic (python code that takes these files as inputs - I already have these).
  3. So, Created a Lambda with Python 3.10 runtime

Explanation: Assume, The upload happens for 10KB avi file and 7MB txt file, it will trigger the lambda function, below are the steps am following,

  1. download the avi into lambda environment from s3
  2. apply transformation logic
  3. store it in another path of lambda execution environment
  4. video stored in #3 will be picked by another transformation logic and produce a CSV file Now, When I upload 1GB avi video file, the first step of transformation logic takes more than 15minutes. So, lambda execution times out.

FYI: for storing the large files (downloading from s3 as mentioned in explaination #1), I connected lambda function with EFS. And that EFS is mounted on EC2.

Questions:

  1. So, How do I get the long running process running using lambda?
  2. I would like to increase the timeout of lambda function to 5Hrs. Can I get help from AWS team and get the timeout increased to 5Hrs?
  3. Or is there any other ways to achieve this?

Thanks.

2 Answers
1

So, How do I get the long running process running using lambda?

Lambda cannot be used for processes that run for more than 15 minutes.
So I think we need to consider other AWS services.
Examples include AWS Batch and ECS.
We could also consider AWS Glue, which is suited for ETL and other processes.
https://docs.aws.amazon.com/batch/latest/userguide/what-is-batch.html
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/Welcome.html
https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html

I would like to increase the timeout of lambda function to 5Hrs. Can I get help from AWS team and get the timeout increased to 5Hrs?

Unfortunately, contacting AWS support will not create a Lambda that will run for more than 15 minutes.

Or is there any other ways to achieve this?

Please consider services other than Lambda here as explained above.
In this case, you may want to use AWS Glue since you already have the Python code.
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python.html

profile picture
EXPERT
answered a year ago
  • Hi Riku Kobayashi, Thank you for your input.

    Need two clarifications regarding using custom libraries in Glue.

    Context: I am currently exploring Glue and how best would it fit my requirement. For the data transformation logic I mentioned earlier, I need four python packages

    1. ffmpeg-python
    2. pandas
    3. opencv-python
    4. numpy

    issue with pandas libary:

    1. in glue job details -> advanced properties -> Python library path info section (click on "info" next to this title)
    2. in that info section, its clearly stated that "Only pure Python libraries can be used. Libraries that rely on C extensions, such as pandas (Python data analysis) library, are not yet supported."
    3. Actually I need to use pandas in the data transformation process.

    Question: Could you please guide me how can I handle this scenario?

    issue with ffmpeg-python libary: Unable to import ffmpeg-python into glue code even after linking the package from s3.

    steps I did to link package to glue:

    1. As suggested in online articles, installed ffmpeg-python in local directory using "pip3 install --target . ffmpeg-python" command.
    2. Then, zipped this directory and uploaded to s3.
    3. And pasted the s3 URI of this library zipped file in python library section in glue configuration.
    4. Then added "import ffmpeg" in python code But getting "ModuleNotFoundError: No module named 'ffmpeg'" when I run glue.

    question: Is there any steps I missed to take?

  • I believe AWS Glue 4.0 can use pandas. https://aws.amazon.com/about-aws/whats-new/2022/11/introducing-aws-glue-4-0/?nc1=h_ls
    It appears to be a problem as it appears to follow the procedure in this document. https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html#aws-glue-programming-python-libraries-zipping
    By the way, what command did you use to zip it?

0

As stated, Lambda functions can only run for up to 15 minutes. There is no way to extend that.

Your options are:

  1. Use a different service such as ECS Fargate.
  2. Break the processing into multiple, smaller chunks (not always possible), e.g., take the large file, break it into a few smaller files. Then a Lambda function can processes each chunk. You will use Step Functions to orchestrate the entire process.
profile pictureAWS
EXPERT
Uri
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions