Textract to multi column pdf files

0

I am using the code below that I took from an example https://aws.amazon.com/pt/blogs/machine-learning/automatically-extract-text-and-structured-data-from-documents-with-amazon-textract/, in the example it is used only for a case of 2 columns, in the code where there is division by 2, if my file has 4 columns for example, I just change that it works. But how to detect the amount of columns automatically or some way that I don't need this manual input anymore? In summary I want to use this code for cases of pdf files that have more than 2 columns, how to do it?

import boto3
# Document
s3BucketName = "amazon-textract-public-content"
documentName = "blogs/two-column-image.jpg"

# Amazon Textract client
textract = boto3.client('textract')

# Call Amazon Textract
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

#print(response)

# Detect columns and print lines
columns = []
lines = []
for item in response["Blocks"]:
      if item["BlockType"] == "LINE":
        column_found=False
        for index, column in enumerate(columns):
            bbox_left = item["Geometry"]["BoundingBox"]["Left"]
            bbox_right = item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]
            bbox_centre = item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]/2
            column_centre = column['left'] + column['right']/2

            if (bbox_centre > column['left'] and bbox_centre < column['right']) or (column_centre > bbox_left and column_centre < bbox_right):
                #Bbox appears inside the column
                lines.append([index, item["Text"]])
                column_found=True
                break
        if not column_found:
            columns.append({'left':item["Geometry"]["BoundingBox"]["Left"], 'right':item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]})
            lines.append([len(columns)-1, item["Text"]])

lines.sort(key=lambda x: x[0])
for line in lines:
    print (line[1])
1 Answer
0

You may like to try the Amazon Textract Response Parser for this, and note in particular that the JavaScript/TypeScript library's getLineClustersInReadingOrder() implementation is very different from the Python library's getLinesInReadingOrder().

From a very biased (author's) perspective I would argue that the JS library's current heuristic is better. You can see a couple of example images it's tested against in the code repository - and I'd suggest it's well worth trying out if you're able to consume components in JS or TS as well as Python.

But ultimately, all these methods are rule-based heuristics and none are perfect: Often what you gain in performance on some use cases, you lose in code maintainability and weird/counter-intuitive errors on others. At the extreme, many complex layouts even challenge/break the idea that there's "one correct reading order" for content on a page anyway - like posters or advertisements with very variable text.

I'd suggest to go with the simplest method that works well enough for your actual documents, and also to revisit why you're trying to extract this columnar structure in the first place in case there are better options:

AWS
EXPERT
answered 2 years ago
  • Hi thanks for your reply! Basically i need to extract all the text from several pdf files. and I will save in a structured way. And within these pages I have the variation of 1 to 5 columns sometimes and sometimes not, but the average is 2 columns

  • In this code my big problem is that the columns are variables and this division /2 that is done varies and can be /2, /3, /4 or /5

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions