Why does Textract miss some data in PDF's?

0

When running this file through Textract some of the text is not in the JSON. The missing text is: No. 2367088 Invoice

Why is this missing and how can we capture this text?

Thank you!

  • Any additional thoughts on feedback on this? It is still an open issue.

已提问 2 年前761 查看次数
1 回答
0

The default textract feature will detect attributes, entity from table, forms or your queries. In your case, it seems the invoice contexts was not properly extract from the default model. You could alternatively try to

  1. Build a comprehend custom model to extract the context, e.g. https://aws.amazon.com/blogs/machine-learning/part-2-intelligent-document-processing-with-aws-ai-services/

  2. If your invoice format/size are consistent, may be able to extract the invoice information based on the bounding box position. Code example: https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/09-forms-redaction.py

  3. Last, but not the best suggestion - format the invoice in a table or form

AWS
Jady
已回答 2 年前
  • Hi, thank you for your reply. It's my understanding that these options you are proposing assume that the data I'm looking for appears at all in the JSON file provided as an output from Textract. in my case, the data does not appear. Even if you run the file through the Textract front end GUI, you will see that the data in question isn't extracted at all, not even in the raw text.

    Having the invoice sent in another format isn't an option for us as it comes from an external source.

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则