How does Textract process PDFs with searchable and selectable text? Compared to the "scanned" PDFs?

0

I couldn't find information if Textract working differently with these PDFs. I ponder if there is even a need for Textract if PDF already contains text (which is typically the case for machine generated invoiced and other documents). Textract is still working very well with searchable PDFs.

My question if it makes sense to assess any other services for extracting text? We're going to embed it it with LLM, so we do not care much about form and shape, exact locations of text, overlays and so on.

Thank you!

Roman
feita há um ano272 visualizações
1 Resposta
0

Assuming the text is always searchable/selectable, if you only plan on extracting the raw text and using a standard library does the job, then I'd agree with your assessment that Textract might be overkill. Where Textract really shines is when you do care about the format, structure, location of information, and relationship between blocks / sections of the document.

AWS
NZ
respondido há um ano

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas