- Le plus récent
- Le plus de votes
- La plupart des commentaires
How to extract some of the specific text only from PDF files using python and store the output data into particular columns of Excel.
Here is the sample input PDF file (File.pdf)
Link to the full PDF file File.pdf
enter image description here
We need to extract the value of Invoice Number, Due Date and Total Due from the whole PDF file.
Script i have used so far:
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
output_string = StringIO()
with open('file.pdf', 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
print(output_string.getvalue()
) But not getting the specific output value from the PDF file .
In this case, the person asking the question is asking about Amazon Textract, a managed OCR and text-extraction service. Additionally, he posted this question in the .NET section, so answers using Python are probably not very useful.
You can use combination of services .. below blog explains how
The user who posted the question is already using Textract to extract the text from the document, their question is about Textract confusing a capital letter "I" with the number one "1".
Contenus pertinents
- demandé il y a 6 mois
- demandé il y a 9 mois
- AWS OFFICIELA mis à jour il y a 9 mois
- AWS OFFICIELA mis à jour il y a 10 mois
- AWS OFFICIELA mis à jour il y a 2 ans
- AWS OFFICIELA mis à jour il y a 4 ans
Hi Thanh - I'm assuming you were you using Amazon Textract to do this? You might need to implement some text corrections in your C# code, if (for example) the invoice number always starts with "I" (letter I). In that case, you could just check if (invoiceNumber.StartsWith("I")) { //do replacement here }
Hi Kirk, first I want to say thank you for your support. Yes, I'm using Amazon Textract (using Analyze Expense) Sometimes the invoice numbers start with "1" or "I", not always "I" so we cannot use "replace", I want to get exactly the value from PDF file.