How does Textract process PDFs with searchable and selectable text? Compared to the "scanned" PDFs?

0

I couldn't find information if Textract working differently with these PDFs. I ponder if there is even a need for Textract if PDF already contains text (which is typically the case for machine generated invoiced and other documents). Textract is still working very well with searchable PDFs.

My question if it makes sense to assess any other services for extracting text? We're going to embed it it with LLM, so we do not care much about form and shape, exact locations of text, overlays and so on.

Thank you!

Roman
preguntada hace un año272 visualizaciones
1 Respuesta
0

Assuming the text is always searchable/selectable, if you only plan on extracting the raw text and using a standard library does the job, then I'd agree with your assessment that Textract might be overkill. Where Textract really shines is when you do care about the format, structure, location of information, and relationship between blocks / sections of the document.

AWS
NZ
respondido hace un año

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas