Why does Textract miss some data in PDF's?

0

When running this file through Textract some of the text is not in the JSON. The missing text is: No. 2367088 Invoice

Why is this missing and how can we capture this text?

Thank you!

  • Any additional thoughts on feedback on this? It is still an open issue.

asked 2 years ago725 views
1 Answer
0

The default textract feature will detect attributes, entity from table, forms or your queries. In your case, it seems the invoice contexts was not properly extract from the default model. You could alternatively try to

  1. Build a comprehend custom model to extract the context, e.g. https://aws.amazon.com/blogs/machine-learning/part-2-intelligent-document-processing-with-aws-ai-services/

  2. If your invoice format/size are consistent, may be able to extract the invoice information based on the bounding box position. Code example: https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/09-forms-redaction.py

  3. Last, but not the best suggestion - format the invoice in a table or form

AWS
Jady
answered 2 years ago
  • Hi, thank you for your reply. It's my understanding that these options you are proposing assume that the data I'm looking for appears at all in the JSON file provided as an output from Textract. in my case, the data does not appear. Even if you run the file through the Textract front end GUI, you will see that the data in question isn't extracted at all, not even in the raw text.

    Having the invoice sent in another format isn't an option for us as it comes from an external source.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions