Textract Throttling

Question

So I just had my get document text increased to 35 but they kept everything else the same...
Here is the error I still get: An error occurred (ProvisionedThroughputExceededException) when calling the GetDocumentTextDetection operation (reached max retries: 4): Provisioned rate exceeded

My documents are 1-3 pages long and im only using detect text funtionality. I don't understand why it takes 5 minutes to literally get 15 documents that are 1-3 pages long individually OCR'd

Answer

Hi,

If you're processing just 15 documents, but hitting 35+ TPS on `GetDocumentTextDetection`, I would think maybe you're polling for job completion and the polling retry configuration is a bit off?

In my experience yes, you should be able to achieve higher throughput than the 5min, 15doc, 1-3 pages you mention. But it's worth mentioning that response time through the async APIs can vary, and high throughput for many documents is a different topic from end-to-end response time for one small doc.

The more docs you process in parallel (the default quota limit for that is in the hundreds), the less appropriate polling will be as a strategy for fetching the results: Using [SNS callbacks via Lambda](https://docs.aws.amazon.com/textract/latest/dg/async-notification-payload.html) will ensure you only try to fetch each result as it becomes available... And in that Lambda function you should still check your AWS SDK retry configuration (e.g. [docs here for Python boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html)), to try and avoid issues if e.g. many jobs happen to finish at once.

Code samples like the [Amazon-Textract-Caller](https://github.com/aws-samples/amazon-textract-textractor/tree/master/caller) (not sure if you're using this?) try to offer a helpful utility for small-scale projects, but have to make a trade-off between keeping to a simple client-only solution, versus being scalable for bigger workloads.

Stacks like [Large scale document processing with Amazon Textract](https://github.com/aws-samples/amazon-textract-serverless-large-scale-document-processing) optimize more for scalability, but it means you need to deploy some cloud components too (SNS, Lambda, SQS, etc).

For CDK infrastructure-as-code, you could check out the patterns in [amazon-textract-idp-stack-samples](https://github.com/aws-samples/amazon-textract-idp-cdk-stack-samples) or maybe (a bit more complex because it combines with post-processing models) [amazon-textract-transformer-pipeline](https://github.com/aws-samples/amazon-textract-transformer-pipeline).

Textract Throttling

相关内容