AWS Textract not accepting remote/non-aws document URL ??

0

I am creating a component that extracts documents for its content. Since there are multipage pdf documents, as per my understanding I need to use ‘StartDocumentTextDetectionasync method. This method requires the document to be part of a S3 bucket in AWS, Is this assumption correct? All my documents are in an external , non-aws location, which is basically a DELL EMC ObjectStorage. The document can be accessed via http as well.

Can we pass byte array of an external document to the AWS Textract Async operations or a Url? I can see bytes are accepted for Synchronous Textract operations, but not for Async. Please let me know.

asked 2 months ago54 views
1 Answer
1

Hello,

For asynchronous operations like StartDocumentTextDetection, Amazon Textract requires the input document to be stored in an Amazon S3 bucket AND the asynchronous API does not support passing byte arrays or URLs directly as input. With that in mind, you have the following options:

  • Upload the document to an S3 bucket before processing (the simplest, if possible)
  • If your documents are relatively small (under 5 MB) and you don't need the scalability of asynchronous processing, you can use the synchronous API (DetectDocumentText). The synchronous API accepts byte arrays, allowing you to process documents without storing them in S3

Hope that helps.

Cheers

AWS
answered 2 months ago
AWS
EXPERT
reviewed 2 months ago
  • Does the synchronous API (DetectDocumentText) support multipage document ?

  • Synchronous APIs support single-page documents only, as mentioned here: https://docs.aws.amazon.com/textract/latest/dg/sync.html You could also configure an S3 notification (via Lambda) so your document automatically starts processing as soon as your client uploads it to S3... And/or set up a more complex workflow orchestrated by something like AWS Step Functions, that could take other steps including potentially deleting the document from S3 once the processing is done.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions