textract can't read line feed values in invoice forms

0

Enter image description here

Enter image description here

Now I have a business to read the value of the table in the invoice, but textract can only read by line, and can not read the complete content of the cell, how to solve this? Originally, we get COMPANY, but the return COMPAN and Y

Ken lu
已提問 4 個月前檢視次數 145 次
1 個回答
0

Extract may not always interpret line breaks correctly, leading to incomplete or separated words. Here are some suggestions to address this issue: Before sending the document to Textract, you may consider preprocessing the image or document to improve the OCR (Optical Character Recognition) results. Enhance the image quality, remove unnecessary elements, and ensure that the text is well-aligned.

Here's a basic example of post-processing in Python to address the issue of incomplete words due to line break

import re

def post_process_textract_result(result):
    for block in result['Blocks']:
        if block['BlockType'] == 'LINE':
            text = block['Text']
            # Replace line breaks within words with a space
            text = re.sub(r'(\S)-\n(\S)', r'\1 \2', text)
            block['Text'] = text

# Example usage
post_process_textract_result(textract_result)

This is just a simple example, and you may need to adapt it based on your specific use case and the structure of your documents. if the problem persists and is critical for your business, it's advisable to reach out to AWS Support. They can provide specific guidance and assistance based on your use case

Hope it clarifies and if does I would appreciate answer to be accepted so that community can benefit for clarity, thanks ;)

profile picture
專家
已回答 4 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南