Textract Throttling

0

So I just had my get document text increased to 35 but they kept everything else the same... Here is the error I still get: An error occurred (ProvisionedThroughputExceededException) when calling the GetDocumentTextDetection operation (reached max retries: 4): Provisioned rate exceeded

My documents are 1-3 pages long and im only using detect text funtionality. I don't understand why it takes 5 minutes to literally get 15 documents that are 1-3 pages long individually OCR'd

Bobby
已提问 2 年前1712 查看次数
1 回答
0

Hi,

If you're processing just 15 documents, but hitting 35+ TPS on GetDocumentTextDetection, I would think maybe you're polling for job completion and the polling retry configuration is a bit off?

In my experience yes, you should be able to achieve higher throughput than the 5min, 15doc, 1-3 pages you mention. But it's worth mentioning that response time through the async APIs can vary, and high throughput for many documents is a different topic from end-to-end response time for one small doc.

The more docs you process in parallel (the default quota limit for that is in the hundreds), the less appropriate polling will be as a strategy for fetching the results: Using SNS callbacks via Lambda will ensure you only try to fetch each result as it becomes available... And in that Lambda function you should still check your AWS SDK retry configuration (e.g. docs here for Python boto3), to try and avoid issues if e.g. many jobs happen to finish at once.

Code samples like the Amazon-Textract-Caller (not sure if you're using this?) try to offer a helpful utility for small-scale projects, but have to make a trade-off between keeping to a simple client-only solution, versus being scalable for bigger workloads.

Stacks like Large scale document processing with Amazon Textract optimize more for scalability, but it means you need to deploy some cloud components too (SNS, Lambda, SQS, etc).

For CDK infrastructure-as-code, you could check out the patterns in amazon-textract-idp-stack-samples or maybe (a bit more complex because it combines with post-processing models) amazon-textract-transformer-pipeline.

AWS
专家
Alex_T
已回答 2 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则