Use Textract like traditional OCR software to recognize scanned pages of long texts while retraining the formatting?

0

I'm completely new to Textract, and before taking the plunge of learning the API, I wanted to ask if it is possible to use Textract to recognize scanned pages such as books or scholarly articles while retraining the character and paragraph formatting and have it output a RTF or .DOC text file? Many thanks!

gefragt vor einem Jahr388 Aufrufe
2 Antworten
1
Akzeptierte Antwort

By formatting, I assume you mean font size and style (e.g. bold, italic)? Currently Textract does not extract information on this type of formatting.

The DetectText API currently provides the following information (source):

  • The lines and words of detected text
  • The relationships between the lines and words of detected text
  • The page that the detected text appears on
  • The location of the lines and words of text on the document page

It can also extract tables, forms, and specific information through queries. This page provides a good overview of the output you can expect.

AWS
S_Moose
beantwortet vor einem Jahr
0

Thank you very much for your explanation ! Given that Textract has very high accuracy in terms of correctly recognizing the characters, this would be a great feature to add.

beantwortet vor einem Jahr

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen