textract can't read line feed values in invoice forms

0

Enter image description here

Enter image description here

Now I have a business to read the value of the table in the invoice, but textract can only read by line, and can not read the complete content of the cell, how to solve this? Originally, we get COMPANY, but the return COMPAN and Y

Ken lu
asked 3 months ago139 views
1 Answer
0

Extract may not always interpret line breaks correctly, leading to incomplete or separated words. Here are some suggestions to address this issue: Before sending the document to Textract, you may consider preprocessing the image or document to improve the OCR (Optical Character Recognition) results. Enhance the image quality, remove unnecessary elements, and ensure that the text is well-aligned.

Here's a basic example of post-processing in Python to address the issue of incomplete words due to line break

import re

def post_process_textract_result(result):
    for block in result['Blocks']:
        if block['BlockType'] == 'LINE':
            text = block['Text']
            # Replace line breaks within words with a space
            text = re.sub(r'(\S)-\n(\S)', r'\1 \2', text)
            block['Text'] = text

# Example usage
post_process_textract_result(textract_result)

This is just a simple example, and you may need to adapt it based on your specific use case and the structure of your documents. if the problem persists and is critical for your business, it's advisable to reach out to AWS Support. They can provide specific guidance and assistance based on your use case

Hope it clarifies and if does I would appreciate answer to be accepted so that community can benefit for clarity, thanks ;)

profile picture
EXPERT
answered 3 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions