Textract/Textractor - Separating table and non-table data

0

I am extracting data from documents that include tables and other text that is not in table format (the documents do not include figures). I would like to separate table data from non-table data because my postprocessing is different for table and non-table data. I am working in Python and using AnalyzeDocument with the TABLE and LAYOUT FeatureTypes to extract the data. However, the LAYOUT data includes the text from the TABLE, which makes it difficult to separate out the non-table data. Can you suggest a way to separate the table data/text from the non-table data/text? Can it be done using FeatureTypes, or does it need to be done at the BLOCK level? Can you point me to any sample code?

posta un mese fa151 visualizzazioni
1 Risposta
0

Hello good afternoon,

Thank you for your question. There is a library published in AWS Samples that can help you called Amazon Textract Textractor. Link: https://github.com/aws-samples/amazon-textract-textractor?tab=readme-ov-file

It has sub modules as described below:

amazon-textract-caller (to simplify calling Amazon Textract without additional dependencies) amazon-textract-response-parser (to parse the JSON response returned by Textract APIs) amazon-textract-overlayer (to draw bounding boxes around the document entities on the document image) amazon-textract-prettyprinter (convert Amazon Textract response to CSV, text, markdown, ...) amazon-textract-geofinder (extract specific information from document with methods that help navigate the document using geometry and relations, e. g. hierarchical key/value pairs)

Probably you can use the amazon-textract-response-parser to separate non table data. Check this link: https://pypi.org/project/amazon-textract-response-parser/

Let me know if it helps.

Thank you.

AWS
con risposta un mese fa
profile picture
ESPERTO
verificato un mese fa
  • Yes, thank you. I appreciate the help. I am familiar with the documentation and code samples. I have not come across anything yet that I recognized as a possible solution.

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande