Is Textract aware of document language and character set used within the language?

0

I've run into problems when trying to get texts from PDF which is in Czech language - some characters especially carons are badly recognized and/or not even returned. If Textract is aware about the source language or region then he should also know character set used within the language (source texts inside the PDF are utf-8 strings, but the national characters which are really used are usually limited to CE subset - such subset is specified in older iso-8859-2 or Latin-2/cp1250 char tables).

The PDF documents have been uploaded into CE region bucket (eu-central-1) and processed in batch by client::StartDocumentTextDetection(options) (the client's region set to eu-central-1 too).

Is there an option to specify source document language or preferred character set or how to enhance the detection results ?

1개 답변
0

Hi, as noted here in the developer guide, Czech is not in the list of language currently officially supported by the service for text extraction. Also as of now, there's no option in the APIs (for example StartDocumentTextDetection or StartDocumentAnalysis) to explicitly specify which language(s) your content is in.

In my experience, Amazon Textract can still work well for other latin-character languages outside the list (for example Indonesian / Malay), but other locales with almost-but-not-quite supported character sets (such as Vietnamese) can be a challenge.

One option you could explore for some locales is to run an (e.g. open-source) spell-checker on the output to try and reconstruct the missing characters / accents? The semantic importance of the unsupported characters will drive how successful this approach can be: If it's usually pretty clear substitution, then great, but if not then a simple dictionary- and rule-based spell-checker may not be sufficient. Apologies I don't have experience with Czech in particular.

If post-processing Amazon Tesseract isn't viable in your particular case, you could perhaps explore:

  • Other 3rd-party OCR offerings available on the AWS Marketplace
  • Open-source tools with existing AWS deployment patterns. For example:
    • This document processing pipeline sample for layout-aware entity recognition can tackle some advanced structure extraction use cases similar to Textract and Comprehend, and uses Textract for OCR by default - but has integration options for multi-lingual models and open-source Tesseract OCR
    • A range of 3rd-party authors have released samples and blogs about deploying Tesseract OCR serverlessly on AWS Lambda
AWS
전문가
Alex_T
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠