- Newest
- Most votes
- Most comments
You may like to try the Amazon Textract Response Parser for this, and note in particular that the JavaScript/TypeScript library's getLineClustersInReadingOrder()
implementation is very different from the Python library's getLinesInReadingOrder()
.
From a very biased (author's) perspective I would argue that the JS library's current heuristic is better. You can see a couple of example images it's tested against in the code repository - and I'd suggest it's well worth trying out if you're able to consume components in JS or TS as well as Python.
But ultimately, all these methods are rule-based heuristics and none are perfect: Often what you gain in performance on some use cases, you lose in code maintainability and weird/counter-intuitive errors on others. At the extreme, many complex layouts even challenge/break the idea that there's "one correct reading order" for content on a page anyway - like posters or advertisements with very variable text.
I'd suggest to go with the simplest method that works well enough for your actual documents, and also to revisit why you're trying to extract this columnar structure in the first place in case there are better options:
- If you're just looking for a 1D text sequence to feed into a downstream NLP model, maybe revisit whether Textract Queries, Comprehend Native Document NER or other custom 2D layout-aware language models on SageMaker could solve that end task better?
- If you're just looking for a transcript of the document, maybe you could present it to end users in 2D format (using the word positions & boxes returned by Textract) instead of trying to reduce to a single sequence?
Relevant content
- asked 4 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 8 months ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 2 years ago
Hi thanks for your reply! Basically i need to extract all the text from several pdf files. and I will save in a structured way. And within these pages I have the variation of 1 to 5 columns sometimes and sometimes not, but the average is 2 columns
In this code my big problem is that the columns are variables and this division /2 that is done varies and can be /2, /3, /4 or /5