Options for processing large documents using Amazon Textract?

0

New to Textract. I am using the PHP SDK to access Amazon Textract for DocumentTextDetection (OCR) processing.

This is working so far by downloading the extracted text as local text file. However, as I plan to process documents that could be as large as 750 pages or more, I'm thinking there is some other method available? Currently, I process documents from my S3 bucket. Is there a way to initiate startDocumentTextDetection() through the API, and then have completed document uploaded to S3 bucket and get email notification when completed?

asked 7 months ago574 views
3 Answers
0

Hi, as it shows here, you can use a lambda to process the document extraction https://aws.amazon.com/it/blogs/machine-learning/store-output-in-custom-amazon-s3-bucket-and-encrypt-using-aws-kms-for-multi-page-document-processing-with-amazon-textract/

Take note please of this wrapper for Textract: https://github.com/aws-samples/amazon-textract-textractor that will ease your job with the APIs.

Also, you can trigger the Lambda using EventBridge when a new file is uploaded in the specific S3 bucket and trigger the email notification on an SNS topic when finished. https://aws.amazon.com/it/blogs/aws/new-use-amazon-s3-event-notifications-with-amazon-eventbridge/

I hope this helps

profile pictureAWS
answered 7 months ago
  • It helps! I want to output to S3. Don't need to encrypt. If I could just find the code to do it in php I'll be good!

    I've modified my code to upload to my S3 bucket. But, this is what I get:

    output/ output//f968fe264ba627742badd701c2a7aed5cf02ed64424c6d98d4ea74963438bbda/.s3_access_check output//f968fe264ba627742badd701c2a7aed5cf02ed64424c6d98d4ea74963438bbda/1 output//f968fe264ba627742badd701c2a7aed5cf02ed64424c6d98d4ea74963438bbda/2

    This is a bunch of stuff, but not the text I'm looking for. How do I get it to upload the scanned text as a text file? If I can't do that, how do I decode this stuff? Thanks!

  • You are looking at the paginated output from the asynchronous Textract processing and need to concatenate the files. Check out this imple for a way how to achieve this in Python https://github.com/aws-samples/amazon-textract-textractor/blob/6e7125c51a351900089102bee1ef2c679c635df2/caller/textractcaller/t_call.py#L195 and you can convert that to PHP.

  • Got it. Thanks! Able to concatenate and download as text. Just concerned about what happens when it's 800 pages or more.

0

Hi,

I'd would strongly recommend you to read about the limitations: https://docs.aws.amazon.com/textract/latest/dg/limits-document.html

Based on the following and the size of your project, you will have to work asynchronously:

File Size and Page Count Limits

For synchronous operations, JPEG, PNG, PDF, and TIFF files have a limit of 10 MB in memory. PDF and TIFF files also have a limit of 1 page. For asynchronous operations, JPEG and PNG files have a limit of 10 MB in memory. PDF and TIFF files have a limit of 500 MB in memory. PDF and TIFF files have a limit of 3,000 pages.

So, you should call textract asynchronously and you can be notified via a Lambda when the results are written back to our S3 bucket: https://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html

Best,

Didier

profile pictureAWS
EXPERT
answered 7 months ago
0

Also check out the scale samples, which include a DocumentSplitter that can be configured to split documents when > 3000 pages.

I tested it with the OpenSearchWorkflow as well and documents that are 10k pages. There is a blog post going into detail of the setup, which I tested with 100k documents and 1.6 million pages (fully processed in 4.5h in us-east-1).

AWS
answered 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions