Options for processing large documents using Amazon Textract?

0

New to Textract. I am using the PHP SDK to access Amazon Textract for DocumentTextDetection (OCR) processing.

This is working so far by downloading the extracted text as local text file. However, as I plan to process documents that could be as large as 750 pages or more, I'm thinking there is some other method available? Currently, I process documents from my S3 bucket. Is there a way to initiate startDocumentTextDetection() through the API, and then have completed document uploaded to S3 bucket and get email notification when completed?

gefragt vor 7 Monaten627 Aufrufe
3 Antworten
0

Hi, as it shows here, you can use a lambda to process the document extraction https://aws.amazon.com/it/blogs/machine-learning/store-output-in-custom-amazon-s3-bucket-and-encrypt-using-aws-kms-for-multi-page-document-processing-with-amazon-textract/

Take note please of this wrapper for Textract: https://github.com/aws-samples/amazon-textract-textractor that will ease your job with the APIs.

Also, you can trigger the Lambda using EventBridge when a new file is uploaded in the specific S3 bucket and trigger the email notification on an SNS topic when finished. https://aws.amazon.com/it/blogs/aws/new-use-amazon-s3-event-notifications-with-amazon-eventbridge/

I hope this helps

profile pictureAWS
beantwortet vor 7 Monaten
  • It helps! I want to output to S3. Don't need to encrypt. If I could just find the code to do it in php I'll be good!

    I've modified my code to upload to my S3 bucket. But, this is what I get:

    output/ output//f968fe264ba627742badd701c2a7aed5cf02ed64424c6d98d4ea74963438bbda/.s3_access_check output//f968fe264ba627742badd701c2a7aed5cf02ed64424c6d98d4ea74963438bbda/1 output//f968fe264ba627742badd701c2a7aed5cf02ed64424c6d98d4ea74963438bbda/2

    This is a bunch of stuff, but not the text I'm looking for. How do I get it to upload the scanned text as a text file? If I can't do that, how do I decode this stuff? Thanks!

  • You are looking at the paginated output from the asynchronous Textract processing and need to concatenate the files. Check out this imple for a way how to achieve this in Python https://github.com/aws-samples/amazon-textract-textractor/blob/6e7125c51a351900089102bee1ef2c679c635df2/caller/textractcaller/t_call.py#L195 and you can convert that to PHP.

  • Got it. Thanks! Able to concatenate and download as text. Just concerned about what happens when it's 800 pages or more.

0

Hi,

I'd would strongly recommend you to read about the limitations: https://docs.aws.amazon.com/textract/latest/dg/limits-document.html

Based on the following and the size of your project, you will have to work asynchronously:

File Size and Page Count Limits

For synchronous operations, JPEG, PNG, PDF, and TIFF files have a limit of 10 MB in memory. PDF and TIFF files also have a limit of 1 page. For asynchronous operations, JPEG and PNG files have a limit of 10 MB in memory. PDF and TIFF files have a limit of 500 MB in memory. PDF and TIFF files have a limit of 3,000 pages.

So, you should call textract asynchronously and you can be notified via a Lambda when the results are written back to our S3 bucket: https://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html

Best,

Didier

profile pictureAWS
EXPERTE
beantwortet vor 7 Monaten
0

Also check out the scale samples, which include a DocumentSplitter that can be configured to split documents when > 3000 pages.

I tested it with the OpenSearchWorkflow as well and documents that are 10k pages. There is a blog post going into detail of the setup, which I tested with 100k documents and 1.6 million pages (fully processed in 4.5h in us-east-1).

AWS
beantwortet vor 7 Monaten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen