How does Textract process PDFs with searchable and selectable text? Compared to the "scanned" PDFs?

0

I couldn't find information if Textract working differently with these PDFs. I ponder if there is even a need for Textract if PDF already contains text (which is typically the case for machine generated invoiced and other documents). Textract is still working very well with searchable PDFs.

My question if it makes sense to assess any other services for extracting text? We're going to embed it it with LLM, so we do not care much about form and shape, exact locations of text, overlays and so on.

Thank you!

Roman
질문됨 일 년 전272회 조회
1개 답변
0

Assuming the text is always searchable/selectable, if you only plan on extracting the raw text and using a standard library does the job, then I'd agree with your assessment that Textract might be overkill. Where Textract really shines is when you do care about the format, structure, location of information, and relationship between blocks / sections of the document.

AWS
NZ
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠