How can I send multiple pages to Textract using python

0

I want to send a complete pdf document to Textract, some pages have tables and some don't how can I send the entire document such that only tables are extracted and CSVs are generated (CSV generation I know).

Also, is the price for each document/page sent to Textract API or only ones with tables in it..!? Because if I send a document with 10 pages and only pages 6 and 8 have tables. So it won't make sense to spend for the rest of the pages.. is there any alternate ways that you can suggest just to identify if the page has table in it or not..

asked 4 months ago184 views
1 Answer
0

Amazon Textract charges per page processed, regardless of whether tables are present.

To minimize costs, you could implement a pre-processing step to identify which pages contain tables before sending them to Textract.

like PyMuPDF can be used to scan through pages and identify tables based on patterns like the presence of multiple rows and columns.

pass all pages though a loop using a function like this:

def detect_table(text): lines = text.split('\n') table_lines = [line for line in lines if line.count(' ') > 2] # Simplistic example return len(table_lines) > 3

answered 4 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions