AWS re:Postを使用することにより、以下に同意したことになります AWS re:Post 利用規約

How to accurately extract values ​​from PDF

0

I was trying to extract invoice number from PDF file (using Amazon Textract - Analyze Expense), I uploaded pdf file and then analyze but it returned this error UnsupportedDocumentException. Then I converted pdf to image to analyze, it's fine but it returned 1424041161 instead of I424041161

It returned #1 instead of #I.

Enter image description here Sorry for I cannot share pdf file because it's private.

Update the question, I got the error UnsupportedDocumentException because the pdf has more than 1 page. So how to analyze pdf has more than 1 page and how to return #I instead of #1?

  • Hi Thanh - I'm assuming you were you using Amazon Textract to do this? You might need to implement some text corrections in your C# code, if (for example) the invoice number always starts with "I" (letter I). In that case, you could just check if (invoiceNumber.StartsWith("I")) { //do replacement here }

  • Hi Kirk, first I want to say thank you for your support. Yes, I'm using Amazon Textract (using Analyze Expense) Sometimes the invoice numbers start with "1" or "I", not always "I" so we cannot use "replace", I want to get exactly the value from PDF file.

質問済み 7ヶ月前402ビュー
2回答
6

How to extract some of the specific text only from PDF files using python and store the output data into particular columns of Excel.

Here is the sample input PDF file (File.pdf)

Link to the full PDF file File.pdf

enter image description here

We need to extract the value of Invoice Number, Due Date and Total Due from the whole PDF file.

Script i have used so far:

from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open('file.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

print(output_string.getvalue()

) But not getting the specific output value from the PDF file .

エキスパート
回答済み 7ヶ月前
  • In this case, the person asking the question is asking about Amazon Textract, a managed OCR and text-extraction service. Additionally, he posted this question in the .NET section, so answers using Python are probably not very useful.

0
AWS
回答済み 7ヶ月前
  • The user who posted the question is already using Textract to extract the text from the document, their question is about Textract confusing a capital letter "I" with the number one "1".

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ