Textract errors


I've run a test of Textract on a document and found several problems. The language of the document is Dutch. I've boiled down the issues to one short excerpt of the document. In the Texttract demo, I submitted a jpg with a 300 dpi image of the following text:

Naast studenten en meer aselecte groepen zijn gevolgen ook onderzocht 
bij specifieke populaties zoals druggebruikers en psychiatrische patiënten. 
Onder 200 straatprostituees, die buiten officiële instanties om zijn bena-
derd, is onderzoek gedaan door Silbert & Pines (1981).

In the "Layout" tab of the demo window, the result is:

Naast studenten en meer aselecte groepen zijn gevolgen ook onderzocht bij specifieke populaties zoals druggebruikers en psychiatrische patienten. Onder 200 straatprostituées, die buiten officièle instanties om zijn bena- derd, is onderzoek gedaan door Silbert & Pines (1981).

The test results have the following problems:

  1. The "ë" in "patiënten" is converted to "e", without the umlaut (two dots above).
  2. The "ë" in "officiële" is converted to "è", with the wrong accent.
  3. The hyphenated word "benaderd" is not returned correctly, but is broken up by the hyphen and a space.
  4. Not evident in the sample above, but another problem observed in the test is that there is no detection of italics.

Are these problems to be expected in Textract, or is there a way to overcome them? If this is the best Textract can do, is there a better OCR engine I should use instead?

asked 3 months ago155 views
1 Answer
Accepted Answer

Textract can have issues with certain special characters and does not handle hyphenated words across lines well. It also does not support the detection of italics.

These issues are inherent to Textract and there’s no direct way to overcome them within Textract itself. Post-processing the Textract output with custom code could help mitigate some of these issues.

Alternatives to Amazon Textract include Docparser, Nanonets, and Rossum. These services might offer better handling of the issues you’re facing with Textract. However, testing them with your specific use case is recommended.

profile picture
answered 3 months ago
profile picture
reviewed 3 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions