In this article, you will learn how to build multi-lingual search for scenario-3, which enables users to perform model-driven language-agnostic search
Welcome to Thank Goodness It's Search series—your Friday fix of OpenSearch learnings, feature drops, and real-world solutions. I will keep it short, sharp, and search-focused—so you can end your week a little more knowledge on Search than you started.
In my previous re:Post on strategies for multi-lingual search, , I covered 3 different scenarios for handling multi-lingual search, based on your source data structure, OpenSearch index configuration, and user search patterns. In my previous re:Post, i covered scenario #2. Today, let's explore Scenario #3 and learn how to build a multi-lingual, language-agnostic search solution using Amazon OpenSearch Service.
This scenario is ideal for multi-supplier product catalogs where you need to surface all relevant results regardless of language or supplier. The objective is straightforward - display matching products in any supported language, falling back to the closest semantic matches when exact matches aren't found. Your catalog data can exist in multiple languages and users can search using their language of choice. Note that while this approach is language-agnostic, it may have limitations distinguishing between regional language variants (such as fr_FR vs fr_CA). The accuracy of language identification is highly dependent on the capabilities of the underlying model.
OpenSearch provides a pre-trained model list with numerous models that can be deployed either as a "local" model directly in OpenSearch or as a "remote" model in Amazon SageMaker. After deployment, the model can be integrated with OpenSearch via a generated model ID that links to your index and search schema.
In this post, you will learn how to do a quick proof-of-concept with a local model. For more detailed usage with remote models, please refer to this blog.
Hosting models in OpenSearch
Did you know that you can upload models from the pre-trained models list to your OpenSearch cluster and call them natively via the neural pipeline? OpenSearch local models work by running machine learning models directly within the OpenSearch cluster, eliminating the need for external ML services. The model is copied to every data node in the OpenSearch cluster. In this post, we'll be using the huggingface/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model.
How does local model work ?
OpenSearch enables the use of local machine learning models (the limited pre-trained model list here) within its cluster, primarily through the ML Commons plugin. This allows you to integrate and leverage AI capabilities directly within your search and analytics workflows. There are 4 steps to this process as listed here.
- Register the Model: Models are uploaded and registered in the cluster using the ML Commons plugin. Using this feature, upload this model
- Deploy the Model: Models are deployed to specific nodes with ML capabilities
- Set up Neural Pipeline: Create the Neural pipeline and assign the model ID.
- Integrate with custom neural pipeline: Create the index schema and set the default pipeline to the new neural pipeline created
- Index & Search: During ingestion, the schema configuration ensures the local model is called to generate the required embeddings. When searching, the configured model ID transforms the queries into vector embeddings, enabling kNN vector search.
I've added a set of screenshots below to demonstrate language-agnostic multi-lingual search capabilities across English, Spanish, Japanese and Hindi. The video highlights an interesting comparison - while BM25 results show zero-shot relevancy by bringing the best matches to position #1, the quality deteriorates as you scroll down. In contrast, semantic search delivers consistently better results by surfacing various semantically similar alternatives to the query. With our English-only dataset, lexical search fails to return results for non-English queries. I also called this out in my earlier re:Post on "Do you need vectors?? But! the model-driven vector search performs language-agnostic matching, effectively treating queries as if they were in English.

For the complete implementation details, check out the sample scripts here or another similar re:Post from a colleague. The paraphrase-multilingual-MiniLM-L12-v2 model we're using provides out-of-the-box support for over 50 languages, including double-byte character languages like Arabic, Hindi and Japanese. It handles searches both with and without diacritical marks. However, an important caveat - local models should only be used for quick proof-of-concepts. For production deployments, I strongly recommend following the guide here to implement models via Amazon SageMaker or use pretrained models through Amazon Bedrock.
Conclusion
In this post, we explored how to implement a multi-lingual search solution using OpenSearch's local model capabilities. The key takeaways are:
- The paraphrase-multilingual-MiniLM-L12-v2 model enables language-agnostic semantic search across 50+ languages
- Local model deployment provides a quick way to test multi-lingual search capabilities but use it to do a quick proof-of-concept only.
- The solution handles diacritics and double-byte character languages effectively
- This approach elimiates language-specific configurations and mappings which we saw in Scenario 2
Next Steps
To take this solution further, consider:
- Deploy embedding models on Amazon SageMaker or leverage pre-trained models on Amazon Bedrock for production workloads
- Implement language detection to improve search relevance. I plan to address language-detection in my next re:Post.
- Add query expansion to handle regional language variations
- Monitor model performance and relevance metrics
- Explore other pre-trained models for specific language pairs or domains
References
- Deploying remote model
- Local vs. Remote model performance
- OpenSearch Pre-trained Models Documentation
- ML Commons Plugin Documentation
- Semantic Search Reference Implementation
- Amazon SageMaker Integration with OpenSearch