Extracting data from PDF that contains strikeout text using Amazon Textract in Python

0

I am trying to follow the guidance from this AWS Textract article:

https://aws.amazon.com/blogs/machine-learning/amazon-textracts-new-layout-feature-introduces-efficiencies-in-general-purpose-and-generative-ai-document-processing-tasks/

This is my input document: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_09.pdf

This is the code I'm running: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/textract02.py https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/textract01.py

And THIS is the output:

https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_09_output.txt

It's missing large amounts of text that should be included.

What I really want to do is exclude the strikeout text and included everything else: content, footers, table.

Any suggestions?

asked 4 months ago331 views
1 Answer
0

Hi,

From your description, it seems that parts of of your PDF text is in images (maybe some scanned paper pages ?) that are not processed by Textract.

In that case, I would suggest to extract those images from your PDF to process them via Claude Anthropic Sonnet (on Amazon Bedrock) vision features: I demonstrate those features in this article: https://repost.aws/articles/AReXoGO615SFSqDIVtcLaAGw/anthropic-claude-3-sonnet-vision-capabilities

You will see that Sonnet is quite smart at "reading"

Best,

Didier

profile pictureAWS
EXPERT
answered 4 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions