Why does Textract miss some data in PDF's?

0

When running this file through Textract some of the text is not in the JSON. The missing text is: No. 2367088 Invoice

Why is this missing and how can we capture this text?

Thank you!

  • Any additional thoughts on feedback on this? It is still an open issue.

posta 2 anni fa760 visualizzazioni
1 Risposta
0

The default textract feature will detect attributes, entity from table, forms or your queries. In your case, it seems the invoice contexts was not properly extract from the default model. You could alternatively try to

  1. Build a comprehend custom model to extract the context, e.g. https://aws.amazon.com/blogs/machine-learning/part-2-intelligent-document-processing-with-aws-ai-services/

  2. If your invoice format/size are consistent, may be able to extract the invoice information based on the bounding box position. Code example: https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/09-forms-redaction.py

  3. Last, but not the best suggestion - format the invoice in a table or form

AWS
Jady
con risposta 2 anni fa
  • Hi, thank you for your reply. It's my understanding that these options you are proposing assume that the data I'm looking for appears at all in the JSON file provided as an output from Textract. in my case, the data does not appear. Even if you run the file through the Textract front end GUI, you will see that the data in question isn't extracted at all, not even in the raw text.

    Having the invoice sent in another format isn't an option for us as it comes from an external source.

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande