1 Answer
- Newest
- Most votes
- Most comments
0
Hi,
From your description, it seems that parts of of your PDF text is in images (maybe some scanned paper pages ?) that are not processed by Textract.
In that case, I would suggest to extract those images from your PDF to process them via Claude Anthropic Sonnet (on Amazon Bedrock) vision features: I demonstrate those features in this article: https://repost.aws/articles/AReXoGO615SFSqDIVtcLaAGw/anthropic-claude-3-sonnet-vision-capabilities
You will see that Sonnet is quite smart at "reading"
Best,
Didier
Relevant content
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated 2 years ago
The images are just the headers for the table, which, by the way, is cut off.
Here is the Marker markdown of the PDF: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_09.md This is the original PDF for comparison: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_09.pdf
So, my question is: Is there a way using Textract to identify the strikeout text and remove it from final output?
Here is the Textract output of the same PDF: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_09_output.txt
Note how much is missing, including the entire table! Not to mention that, for some reason, it repeats itself.
Here is the Textract code used: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/textract02.py
I'm just trying to figure out if there is a way to exclude the strikeout text but include everything else.