- Newest
- Most votes
- Most comments
Hi, as noted here in the developer guide, Czech is not in the list of language currently officially supported by the service for text extraction. Also as of now, there's no option in the APIs (for example StartDocumentTextDetection or StartDocumentAnalysis) to explicitly specify which language(s) your content is in.
In my experience, Amazon Textract can still work well for other latin-character languages outside the list (for example Indonesian / Malay), but other locales with almost-but-not-quite supported character sets (such as Vietnamese) can be a challenge.
One option you could explore for some locales is to run an (e.g. open-source) spell-checker on the output to try and reconstruct the missing characters / accents? The semantic importance of the unsupported characters will drive how successful this approach can be: If it's usually pretty clear substitution, then great, but if not then a simple dictionary- and rule-based spell-checker may not be sufficient. Apologies I don't have experience with Czech in particular.
If post-processing Amazon Tesseract isn't viable in your particular case, you could perhaps explore:
- Other 3rd-party OCR offerings available on the AWS Marketplace
- Open-source tools with existing AWS deployment patterns. For example:
- This document processing pipeline sample for layout-aware entity recognition can tackle some advanced structure extraction use cases similar to Textract and Comprehend, and uses Textract for OCR by default - but has integration options for multi-lingual models and open-source Tesseract OCR
- A range of 3rd-party authors have released samples and blogs about deploying Tesseract OCR serverlessly on AWS Lambda
Relevant content
- asked 2 years ago
- asked 5 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 4 years ago
- AWS OFFICIALUpdated 2 years ago