Why does Textract miss some data in PDF's?

0

When running this file through Textract some of the text is not in the JSON. The missing text is: No. 2367088 Invoice

Why is this missing and how can we capture this text?

Thank you!

  • Any additional thoughts on feedback on this? It is still an open issue.

preguntada hace 2 años761 visualizaciones
1 Respuesta
0

The default textract feature will detect attributes, entity from table, forms or your queries. In your case, it seems the invoice contexts was not properly extract from the default model. You could alternatively try to

  1. Build a comprehend custom model to extract the context, e.g. https://aws.amazon.com/blogs/machine-learning/part-2-intelligent-document-processing-with-aws-ai-services/

  2. If your invoice format/size are consistent, may be able to extract the invoice information based on the bounding box position. Code example: https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/09-forms-redaction.py

  3. Last, but not the best suggestion - format the invoice in a table or form

AWS
Jady
respondido hace 2 años
  • Hi, thank you for your reply. It's my understanding that these options you are proposing assume that the data I'm looking for appears at all in the JSON file provided as an output from Textract. in my case, the data does not appear. Even if you run the file through the Textract front end GUI, you will see that the data in question isn't extracted at all, not even in the raw text.

    Having the invoice sent in another format isn't an option for us as it comes from an external source.

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas