AWS re:Post을(를) 사용하면 다음에 동의하게 됩니다. AWS re:Post 이용 약관

How to accurately extract values ​​from PDF

0

I was trying to extract invoice number from PDF file (using Amazon Textract - Analyze Expense), I uploaded pdf file and then analyze but it returned this error UnsupportedDocumentException. Then I converted pdf to image to analyze, it's fine but it returned 1424041161 instead of I424041161

It returned #1 instead of #I.

Enter image description here Sorry for I cannot share pdf file because it's private.

Update the question, I got the error UnsupportedDocumentException because the pdf has more than 1 page. So how to analyze pdf has more than 1 page and how to return #I instead of #1?

  • Hi Thanh - I'm assuming you were you using Amazon Textract to do this? You might need to implement some text corrections in your C# code, if (for example) the invoice number always starts with "I" (letter I). In that case, you could just check if (invoiceNumber.StartsWith("I")) { //do replacement here }

  • Hi Kirk, first I want to say thank you for your support. Yes, I'm using Amazon Textract (using Analyze Expense) Sometimes the invoice numbers start with "1" or "I", not always "I" so we cannot use "replace", I want to get exactly the value from PDF file.

2개 답변
6

How to extract some of the specific text only from PDF files using python and store the output data into particular columns of Excel.

Here is the sample input PDF file (File.pdf)

Link to the full PDF file File.pdf

enter image description here

We need to extract the value of Invoice Number, Due Date and Total Due from the whole PDF file.

Script i have used so far:

from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open('file.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

print(output_string.getvalue()

) But not getting the specific output value from the PDF file .

전문가
답변함 7달 전
  • In this case, the person asking the question is asking about Amazon Textract, a managed OCR and text-extraction service. Additionally, he posted this question in the .NET section, so answers using Python are probably not very useful.

0
AWS
답변함 7달 전
  • The user who posted the question is already using Textract to extract the text from the document, their question is about Textract confusing a capital letter "I" with the number one "1".

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인