Textract Async processing of larger files

0

I am using textract(detect text) Async to process the documents. Let's say if I have a document of 3000 pages I am sending this whole document to textract by Lambda, means this 3000 pages processing will be done in Lambda function at a time by texract API call. My query is

  1. is this doable as LAMBDA will get expired in 15mins and probably these 3000 pages would not be processed in time
  2. Does 600 Maximum number of asynchronous jobs simultaneously means 600 documents or it means 600pages in a document. So if a document is 3000 pages then it would be call as 3000 concurrent jobs or 1 job?
  3. Any average how many number of pages ina document can be processed in 10mins. (Just asking average as I know it totally depends on file quality etc)

Thankyou.

asked a year ago1612 views
3 Answers
1

One way to use Textract inside a lambda is use a SNS to receive the result.

    response = client.start_document_analysis(
        DocumentLocation=document_location,
        FeatureTypes=["TABLES", "FORMS"],
        NotificationChannel={
            "SNSTopicArn": os.environ["SNS_TOPIC_ARN"],
            "RoleArn": os.environ["SNS_ROLE_ARN"],
        },
    )

And use a second lambda subscribed to the SNS to parse the results if needed.

answered a year ago
  • But again in this case LAMBDA could not run more than 15mins, so the job will end for 3000 pages if its not extracted. Am I right?

  • Sure, a lambda can't run for more than 15 minutes. But the lambda will run for just a few milliseconds, just to made the start_document_analysis asynchronous call.

0

As correctly noted on the other answer (+1), the best way to handle this is with SNS callback.

You don't want your Lambda function to stay active waiting for the Amazon Textract job to complete because:

  • You're consuming (i.e. paying for!) Lambda compute/memory that isn't really doing anything except waiting
  • As you called out, it wouldn't be a good solution if a document could take longer than 15 minutes to process
  • If you take a polling approach and try to run a large number of concurrent jobs (e.g. approaching 600), you'll find that the quota limit on GetDocumentTextDetection/GetDocumentAnalysis can also become a significant limiting factor - reducing the frequency with which you can poll each active job and therefore unnecessarily increasing your end-to-end process latency.

Instead, would suggest to:

  • Have your initial Lambda function responsible only for starting the job, and exit as soon as this is done.
  • Create an SNS topic and use the NotificationChannel parameter in your StartDocumentAnalysis or similar API call, to request Amazon Textract sends a notification to the topic on job completion.
  • Configure a second Lambda to receive events from the SNS topic, resuming processing when the job is complete in an event-driven way rather than by status polling.
  • Consider whether other components would help meet your end-to-end process needs: For example using an SQS Queue to buffer and retry SNS events in case of sustained failures in the callback Lambda; or an AWS Step Functions State Machine to orchestrate a longer sequence of document processing steps beyond the OCR in an event-driven, serverless way.

Some useful examples to refer to:

On your other questions:

(2) The async concurrent job limit quotas refer to jobs, not pages: Each submitted document is just one job, regardless of length.

(3) Sorry, in my experience there really are too many factors at play to give a useful estimate! For what it's worth I tentatively think you could expect even a 3k page document to complete within 15 minutes (assuming it meets the other quotas) - But if you try to encapsulate the async job in a synchronous Lambda function, it's not just the processing time you'd need to worry about! What if you get throttled when trying to create the job and need to retry for an extended period? What if you get throttled when trying to fetch the result? If your workload is very spiky/batchy, you could definitely run in to scenarios where the overall process of getting a long-document job submitted successfully and a result back takes longer than 15 minutes.

AWS
EXPERT
Alex_T
answered a year ago
  • Thanks Alex, That is great help. Let me implement and will share my finding, but here I got one more problem. Textract is taking files in pdf /image formats. So what should I do with the .docx or excel files. This needs to converted lets say I have 3000pages word document(docx) then how to convert into pdf ,so this can be ingested in textract.

  • Hmm, unfortunately Office file conversions will be separate and probably a big enough topic to raise as a separate question in case others have more insight! In general would say high-fidelity docx conversion will be hard without an actual licensed Word installation, but might be possible with OSS using e.g. LibreOffice (as discussed here: https://stackoverflow.com/a/56067358)... For Excel I would probably challenge whether converting these files to PDF is really the best idea? If you're trying to extract tables of data from spreadsheets, it's already possible directly with Python libs?

  • Hi Alex, I tried implementing this and your idea was great.

    1. I m free from LAMBDA 15mins outage, which is good
    2. but textract concurrent jobs limit is 600 and I need to run atleast 3000 docs, what I am following is : a. 3000 docs will upload to S3 b. Lambda will start the job and exit c. SNS will be genrated once job is completed d. after completing job another on SNS another LAMBDA start e. this will do some parsing etc f. upload final files to S3 Now if files are more than 600 then we will get a throttling error as I am uploading 3000 docs at a time, how to handle this will also give TPS error as I can only send 10files ina second. Can u pls share how to handle this?
  • Hey so, the 3 samples linked in the answer should help here but they all use somewhat different approaches to limit the concurrency. the 'Large-scale processing' example is probably the most generally applicable one to try first and uses SQS to store requests (documents) until capacity is available to process them. The 'Textract Transformer Pipeline' sample uses a pattern like this one to limit concurrency in the Step Functions workflow itself.

0

If possible base your solution on the CDK Constructs and samples, which already implement best practices mentioned in the other posts (using SNS e. g.):

Here is a workshop that walks you through the setup in 1.5 h: https://s12d.com/aws-idp-scale-workshop

And here is the code you can use to start: https://github.com/aws-samples/amazon-textract-idp-cdk-stack-samples https://github.com/aws-samples/amazon-textract-idp-cdk-constructs

The Construct TextractGenericAsyncSfnTask does implement the Step Function Task interface and includes a Lambda function with error handling and retries for calling Textract asynchronously, sets up the SNS topic and a Lambda subscribing to that topic to handle further processing. The output is stored in an S3 bucket using the OutputConfig because you eliminate the need to call the Get* API and get potentially throttled when running many concurrent jobs.

AWS
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions