[Thank Goodness its Search] Implementing Multi-lingual Search , Part 1
In this article, you will learn how to build multi-lingual search for scenario-2, which enables users to search in their preferred language based on either manual selection or automatic locale detection
Welcome to Thank Goodness It's Search series—your Friday fix of OpenSearch learnings, feature drops, and real-world solutions. I will keep it short, sharp, and search-focused—so you can end your week a little more knowledge on Search than you started.
In my previous post on multi-lingual search, we explored 3 different scenarios that vary based on how your data currently exists in the source of truth, how you build your OpenSearch index, and how your users search for content. In this post, you will learn to build scenario 2 - where your content exists in multiple languages and users search in specific languages as well. I'll walk you through implementing this scenario step-by-step, so have your OpenSearch cluster ready to follow along 🚶!
For implementing this scenario, we will leverage the out of box OpenSearch language analyzers and tokenizers. OpenSearch supports 35+ built-in language analyzers. These include analyzers for the following languages: Arabic, Armenian, Basque, Bengali, Brazilian, Bulgarian, Catalan, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish, Turkish, and Thai. For this post, we will concentrate on 3 languages: English, Spanish and Japanese (using Kuromoji analyzer). OpenSearch added support for additional Asian languages including Korean, Chinese and Japanese in 2023.
What are OpenSearch analyzer?
When you index a document, the text goes through the chosen analyzer, is converted into tokens, and then stored in the inverted index. When you run a query, the query string is passed through the same (or a compatible) analyzer, ensuring that matches happen against normalized tokens rather than raw text. So, for multi-lingual content, OpenSearch relies on language-specific analyzers to properly handle stemming, stop words, and grammatical rules in each language, which makes search results more accurate and relevant.
OpenSearch provides a rich array of analyzers that cater to various languages, including:
- Standard language analyzers for European languages like English, Spanish, French, etc.
- Specialized analyzers for Asian languages:
- Kuromoji analyzer for Japanese
- Nori analyzer for Korean
- IK analyzer for Chinese
- Support for stemming, stop words, and language-specific tokenization
How do analyzers work?
When you index a document, language analyzers process text through multiple steps.
-
Tokenization: Text is broken into individual tokens based on language rules
- For European languages: splits on spaces and punctuation
- For Asian languages: uses specialized tokenizers (e.g. Kuromoji for Japanese)
-
Stop word removal: Common words that don't add meaning are filtered out
- English: a, an, the, etc.
- Spanish: el, la, los, etc.
- Japanese: の, は, です, etc.
-
Stemming: Words are reduced to their root form
- English: running -> run
- Spanish: corriendo -> corr
- Japanese: 食べます -> 食べ
-
Language-specific normalization:
- Case folding
- Character width normalization for Asian scripts
- Accent removal This processing ensures that searches match relevant documents regardless of word variations and language-specific characteristics
Step 1: Create an Index with Multi-lingual Support
First, you need to create an index that can handle multiple languages. You will define custom analyzers for each language and set up the appropriate mappings.
PUT /multi_lingual_index { "settings": { "analysis": { "analyzer": { "english_analyzer": { "type": "standard", "stopwords": "_english_" }, "spanish_analyzer": { "type": "standard", "stopwords": "_spanish_" }, "japanese_analyzer": { "type": "custom", "tokenizer": "kuromoji_tokenizer", "filter": ["kuromoji_baseform", "kuromoji_part_of_speech", "cjk_width", "lowercase"] } } } }, "mappings": { "properties": { "content_en": { "type": "text", "analyzer": "english_analyzer" }, "content_es": { "type": "text", "analyzer": "spanish_analyzer" }, "content_ja": { "type": "text", "analyzer": "japanese_analyzer" } } } }
Step 2: Index Multi-lingual Documents
Next, let's index some documents in different languages.
POST /multi_lingual_index/_doc/1 { "content_en": "Knowledge gives humility, and from humility comes worthiness?", "content_es": "El conocimiento da humildad, y de la humildad viene la dignidad?", "content_ja": "知識は謙虚さを与え、謙虚さから価値が生まれる" } POST /multi_lingual_index/_doc/2 { "content_en": "There is no friend like knowledge", "content_es": "No hay amigo como el conocimiento", "content_ja": "知識のような友はいない" } POST /multi_lingual_index/_doc/3 { "content_en": "Ignorance is the greatest enemy.", "content_es": "La ignorancia es el mayor enemigo.", "content_ja": "無知は最大の敵である" } POST /multi_lingual_index/_doc/4 { "content_en": "Knowledge used for argument leads to pride, wealth used for show leads to arrogance", "content_es": "El conocimiento usado para discutir lleva al orgullo, la riqueza usada para presumir lleva a la arrogancia", "content_ja": "議論のために使われる知識は傲慢につながり、見せびらかすために使われる富は傲慢さにつながる" } POST /multi_lingual_index/_doc/5 { "content_en": "Wisdom is knowing what you don't know", "content_es": "La sabiduría es saber lo que no sabes", "content_ja": "知恵とは、自分が知らないことを知ることである" }
Verify that your index has been created with the correct schema and analyzers before proceeding to the next step.
Step 3: Search in Specific Languages
Now, let's perform language-specific searches. Users can search in their preferred language, which can be either manually selected or automatically detected based on their locale settings.
GET /multi_lingual_index/_search { "query": { "match": { "content_en": "knowledge" } } }
GET /multi_lingual_index/_search { "query": { "match": { "content_es": "conocimiento" } } }
GET /multi_lingual_index/_search { "query": { "match": { "content_ja": "知識" } } }
Now try the below query
GET /multi_lingual_index/_search { "query": { "match": { "content_ja": "knowledge" } } }
For the above query you would see zero results. This is because the Japanese analyzer uses specialized tokenization rules and character handling specific to Japanese text. When you search with English text against a Japanese-analyzed field, the tokenization and analysis process is incompatible, resulting in no matches. The analyzers need to be language-appropriate for both indexing and searching to work effectively.
Alternatively, if you are trying to address this cross-language search scenario, you have two options:
- Implement language translation to convert English queries to Japanese before searching
- Use a model-based language agnostic search approach (covered in upcoming posts)
Conclusion
In this post, we explored implementing multi-lingual search for scenario 2 where both content and searches are in multiple languages. By following these steps, you've successfully implemented a simple multi-lingual search scenario leveraging the language analyzers that OpenSearch provides out of the box. OpenSearch supports over 35 built-in language analyzers, making it a powerful choice for implementing multi-lingual search capabilities. The language analyzers handle stemming, stop words, and language-specific tokenization automatically, ensuring accurate and relevant search results across different languages. The key takeaway is that language analyzers need to match between indexing and searching for effective multi-lingual search.
Next Steps
To build on this foundation, you can:
- Add support for more languages using OpenSearch's 35+ built-in analyzers
- Implement automatic language detection for incoming content
- Add translation capabilities to enable cross-language searching
- Explore model-based approaches for language-agnostic search (You can learn in my upcoming post)
- Add language-specific relevance tuning and scoring
References
- OpenSearch Language Analyzers Documentation
- New Language Support in OpenSearch
- Japanese (Kuromoji) Auto-complete
- Multi-language Search Best Practices
Call to Action
If you found this article helpful, please share it with your network. If you have any questions or want to discuss ways to improve your search experience, feel free to reach out. Want to learn more? Check out the OpenSearch Documentation
See you next Friday with another search solution. Until then, happy searching! 🔍
