How to use NLTK files and AWS Lambda

0

Hi,

I am deploying a lambda function that utilizes the NLTK packages for preprocessing text. For the application to work I need to download the stop words, punkt and wordnet libraries. I have deployed using a docker image and SAM cli. When the function runs on AWS, I get a series of errors when trying to access the NLTK libraries.

The first error I got was that '/home/sbx_user1051/' cannot be edited. After reading solutions on stack over flow, I was pointed in the direction of needing to store the NLTK libraries in the /tmp/ directory because that is the only directory that can be modified.

Now, after redeploying the image with the changes to the code, I have the files stored in temp, but the lambda function does not search for that file when trying to access the stop words. It still tries to search for the file in these directories:

  • '/home/sbx_user1051/nltk_data'
  • '/var/lang/nltk_data'
  • '/var/lang/share/nltk_data'
  • '/var/lang/lib/nltk_data'
  • '/usr/share/nltk_data'
  • '/usr/local/share/nltk_data'
  • '/usr/lib/nltk_data'
  • '/usr/local/lib/nltk_data'

What should I do about importing the NLTK libraries needed when running this function on aws lambda?

  • Alright, I figured it out. I needed to set the nltk.data.path to the new /tmp/nltk_data directory.

    nltk_data_path = '/tmp/nltk_data'
    
    nltk.download('stopwords', download_dir=nltk_data_path)
    nltk.download('punkt', download_dir=nltk_data_path)
    nltk.download('wordnet', download_dir=nltk_data_path)
    
    nltk.data.path.append(nltk_data_path)
    
    
1 Answer
0

Hi,

I encountered the same problem with the same error messages, and I was able to fix the issue by following these steps:

  1. In the directory where your Python code application is located, create a new directory.

      $ mkdir package
    
  2. Run the following command to install the required libraries for your Python code:

    $ pip install --target ./package <library_name>

Example: $ pip install --target . /package nltk

  1. Invoke the Python interpreter with the following command:

    $ python3

       import nltk
    
  2. Download the required nltk modules for your application:

Example: nltk.download("punkt") nltk.download("stopwords")

  1. Quit the Python interpreter:

      quit()
    
  2. Create a directory named "nltk_data" under /package:

    $ mkdir  nltk_data
    
  3. Copy the nltk data to the directory /package/nltk_data:

    $ cp -R /home/user/nltk_data/* ./nltk_data

Note: Replace "/home/user" with your home directory path.

  1. Create a .zip file with the installed libraries and nltk_data:

         $ cd package
    
         $ zip -r ../my_deployment_package.zip  .   (there is a dot at the end)
    
  2. Add your lambda_function.py (your application) file to the .zip file:

        $ cd ..
    
       $ zip my_deployment_package.zip lambda_function.py
    
  3. Create an AWS Lambda function on AWS and upload the file "my_deployment_package.zip." Pay attention to the following configuration options:

a) Configuration -> General configuration -> Timeout (choose the needed value to run the application).

b) Configuration -> General configuration -> Memory (choose the needed value to run the application).

c) Configuration -> General configuration -> Ephemeral storage (choose the needed value to run the application).

d) Configuration -> Environment variables -> Key = NLTK_DATA, Value = ./nltk_data

I hope it helps!

Marcos
answered 9 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions