How to accurately extract values ​​from PDF

0

I was trying to extract invoice number from PDF file (using Amazon Textract - Analyze Expense), I uploaded pdf file and then analyze but it returned this error UnsupportedDocumentException. Then I converted pdf to image to analyze, it's fine but it returned 1424041161 instead of I424041161

It returned #1 instead of #I.

Enter image description here Sorry for I cannot share pdf file because it's private.

Update the question, I got the error UnsupportedDocumentException because the pdf has more than 1 page. So how to analyze pdf has more than 1 page and how to return #I instead of #1?

  • Hi Thanh - I'm assuming you were you using Amazon Textract to do this? You might need to implement some text corrections in your C# code, if (for example) the invoice number always starts with "I" (letter I). In that case, you could just check if (invoiceNumber.StartsWith("I")) { //do replacement here }

  • Hi Kirk, first I want to say thank you for your support. Yes, I'm using Amazon Textract (using Analyze Expense) Sometimes the invoice numbers start with "1" or "I", not always "I" so we cannot use "replace", I want to get exactly the value from PDF file.

2 Answers
6

How to extract some of the specific text only from PDF files using python and store the output data into particular columns of Excel.

Here is the sample input PDF file (File.pdf)

Link to the full PDF file File.pdf

enter image description here

We need to extract the value of Invoice Number, Due Date and Total Due from the whole PDF file.

Script i have used so far:

from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open('file.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

print(output_string.getvalue()

) But not getting the specific output value from the PDF file .

EXPERT
answered 7 months ago
  • In this case, the person asking the question is asking about Amazon Textract, a managed OCR and text-extraction service. Additionally, he posted this question in the .NET section, so answers using Python are probably not very useful.

0
AWS
answered 7 months ago
  • The user who posted the question is already using Textract to extract the text from the document, their question is about Textract confusing a capital letter "I" with the number one "1".

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions