Options for processing large documents using Amazon Textract?

Question

New to Textract.  I am using the PHP SDK to access Amazon Textract for DocumentTextDetection (OCR) processing.

This is working so far by downloading the extracted text as local text file.  However, as I plan to process documents that could be as large as 750 pages or more, I'm thinking there is some other method available?  Currently, I process documents from my S3 bucket.  Is there a way to initiate startDocumentTextDetection() through the API, and then have completed document uploaded to S3 bucket and get email notification when completed?

Answer

Hi,

I'd would strongly recommend you to read about the limitations: https://docs.aws.amazon.com/textract/latest/dg/limits-document.html

Based on the following and the size of your project,  you will have to work asynchronously:

```
File Size and Page Count Limits

For synchronous operations, JPEG, PNG, PDF, and TIFF files have a limit of 10 MB in memory. PDF and TIFF files also have a limit of 1 page. For asynchronous operations, JPEG and PNG files have a limit of 10 MB in memory. PDF and TIFF files have a limit of 500 MB in memory. PDF and TIFF files have a limit of 3,000 pages.
```

So, you should call textract asynchronously and you can be notified via a Lambda when the results are written back to our S3 bucket: https://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html

Best,

Didier

Answer

Hi, as it shows here, you can use a lambda to process the document extraction
https://aws.amazon.com/it/blogs/machine-learning/store-output-in-custom-amazon-s3-bucket-and-encrypt-using-aws-kms-for-multi-page-document-processing-with-amazon-textract/

Take note please of this wrapper for Textract: https://github.com/aws-samples/amazon-textract-textractor that will ease your job with the APIs.

Also, you can trigger the Lambda using EventBridge when a new file is uploaded in the specific S3 bucket and trigger the email notification on an SNS topic when finished.
https://aws.amazon.com/it/blogs/aws/new-use-amazon-s3-event-notifications-with-amazon-eventbridge/

I hope this helps

Answer

Also check out the [scale samples](https://github.com/aws-solutions-library-samples/guidance-for-low-code-intelligent-document-processing-on-aws), which include a [DocumentSplitter](https://github.com/aws-solutions-library-samples/guidance-for-low-code-intelligent-document-processing-on-aws#document-splitter-workflow) that can be configured to split documents when > 3000 pages.

I tested it with the [OpenSearchWorkflow](https://github.com/aws-solutions-library-samples/guidance-for-low-code-intelligent-document-processing-on-aws#opensearchworkflow) as well and documents that are 10k pages. There is a blog post going into detail of the setup, which I tested with 100k documents and 1.6 million pages (fully processed in 4.5h in us-east-1).

Options for processing large documents using Amazon Textract?

Relevant content