1 Answer
- Newest
- Most votes
- Most comments
0
Amazon Textract charges per page processed, regardless of whether tables are present.
To minimize costs, you could implement a pre-processing step to identify which pages contain tables before sending them to Textract.
like PyMuPDF can be used to scan through pages and identify tables based on patterns like the presence of multiple rows and columns.
pass all pages though a loop using a function like this:
def detect_table(text): lines = text.split('\n') table_lines = [line for line in lines if line.count(' ') > 2] # Simplistic example return len(table_lines) > 3
answered 4 months ago
Relevant content
- asked a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 10 months ago
- AWS OFFICIALUpdated a year ago