- Newest
- Most votes
- Most comments
Preprocess the Document:
Separate the Text Blocks: If possible, preprocess the document to separate vernacular and English text into different images or sections before sending them to Textract. You could then run the OCR on each section individually and combine the results. Manual Orientation Correction: Ensure that the document is correctly oriented before sending it to Textract. You can use image processing tools to detect and correct any misalignment or rotation.
Use Language-Specific OCR Models:
While Textract doesn't allow direct selection of language models, you can preprocess the text by using other OCR tools specifically tuned for the vernacular language and English separately. You can then merge the results manually.
Custom OCR Models:
If this is a recurring issue, you might want to consider training a custom OCR model that is specifically tuned for your use case, handling mixed languages and handwritten text better than the general-purpose model in Textract.
Post-processing:
Implement a post-processing step that checks the OCR output for common errors, especially when mixing languages, and corrects them based on the expected language or context.
Isolate the Word: Test OCR on the word in isolation, which you mentioned works well. This further supports the hypothesis that the mixed language is causing the issue.
Test with Different Configurations: Experiment with different Textract features, such as setting a specific FeatureType (like "FORMS" or "TABLES") to see if it affects the accuracy.
Unfortunately, Textract doesn't allow much fine-tuning of OCR settings directly through the API, but these workarounds might help improve accuracy for your use case.
Relevant content
- asked 2 years ago
- Accepted Answerasked 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 months ago
I am sending the correct Oriented document only but unfortunately Textract itself rotating the document internally and returning the incorrect text. Also, I can pass separate words but that will increase the N number of request and billing.
I am using the Textract service to detect the word itself and for that I am passing the entire document. So, its not possible to Isolate the Word. Even I noticed that if I whiteout most of the Vernacular words and keeping only couple of in that case its also returning wrong results (like flip text) for English Handwritten.