Why does Textract miss some data in PDF's?

0

When running this file through Textract some of the text is not in the JSON. The missing text is: No. 2367088 Invoice

Why is this missing and how can we capture this text?

Thank you!

  • Any additional thoughts on feedback on this? It is still an open issue.

已提問 2 年前檢視次數 761 次
1 個回答
0

The default textract feature will detect attributes, entity from table, forms or your queries. In your case, it seems the invoice contexts was not properly extract from the default model. You could alternatively try to

  1. Build a comprehend custom model to extract the context, e.g. https://aws.amazon.com/blogs/machine-learning/part-2-intelligent-document-processing-with-aws-ai-services/

  2. If your invoice format/size are consistent, may be able to extract the invoice information based on the bounding box position. Code example: https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/09-forms-redaction.py

  3. Last, but not the best suggestion - format the invoice in a table or form

AWS
Jady
已回答 2 年前
  • Hi, thank you for your reply. It's my understanding that these options you are proposing assume that the data I'm looking for appears at all in the JSON file provided as an output from Textract. in my case, the data does not appear. Even if you run the file through the Textract front end GUI, you will see that the data in question isn't extracted at all, not even in the raw text.

    Having the invoice sent in another format isn't an option for us as it comes from an external source.

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南