- Newest
- Most votes
- Most comments
How to extract some of the specific text only from PDF files using python and store the output data into particular columns of Excel.
Here is the sample input PDF file (File.pdf)
Link to the full PDF file File.pdf
enter image description here
We need to extract the value of Invoice Number, Due Date and Total Due from the whole PDF file.
Script i have used so far:
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
output_string = StringIO()
with open('file.pdf', 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
print(output_string.getvalue()
) But not getting the specific output value from the PDF file .
In this case, the person asking the question is asking about Amazon Textract, a managed OCR and text-extraction service. Additionally, he posted this question in the .NET section, so answers using Python are probably not very useful.
You can use combination of services .. below blog explains how
The user who posted the question is already using Textract to extract the text from the document, their question is about Textract confusing a capital letter "I" with the number one "1".
Relevant content
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 8 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 years ago
Hi Thanh - I'm assuming you were you using Amazon Textract to do this? You might need to implement some text corrections in your C# code, if (for example) the invoice number always starts with "I" (letter I). In that case, you could just check if (invoiceNumber.StartsWith("I")) { //do replacement here }
Hi Kirk, first I want to say thank you for your support. Yes, I'm using Amazon Textract (using Analyze Expense) Sometimes the invoice numbers start with "1" or "I", not always "I" so we cannot use "replace", I want to get exactly the value from PDF file.