Multilingual search with Amazon OpenSearch
Multilingual search with Amazon OpenSearch
Motivation
There are often times when you would like to conduct multilingual searches on your corpus of data. For example, your corpus is in English, but your users search in other languages. Moreover, some languages do not have extensive support when using Amazon OpenSearch (e.g., Hebrew).
Architecture
In this small tutorial we will show how we can use vectors with Amazon OpenSearch, where each document we have in our corpus is transformed to a vector. Once a user submits a search query, the query is also transformed to a vector, and using the Amazon OpenSearch’s vectors search capabilities we find vectors in our corpus that are similar to our query vector.
The OpenSearch Neural Search plugin enables the integration of machine learning (ML) language models into your search workloads. During ingestion and search, the Neural Search plugin transforms text into vectors. Then, Neural Search uses the transformed vectors in vector-based search. Note that the language model we will use is defined as multilingual, and this is what allows us to perform the multilingual search.
Step by step tutorial
This tutorial assumes you have an Amazon OpenSearch domain set up with version 2.7 and above. You can than connect to Amazon OpenSearch Dashboards and run the following commands from Dev Tools:
#Run the following command to change OpenSearch Domain setting to enable runing Machine Learning code inside a data node.
PUT /_cluster/settings
{
"persistent":{
"plugins.ml_commons.only_run_on_ml_node": false
}
}
#Upload model and note down the task_id of following command.
#Note we are using a paraphrase-multilingual model.
#This would allow us to run queries in one language and get results from another language
POST /_plugins/_ml/models/_upload
{
"name": "paraphrase-multilingual-MiniLM-L12-v2",
"version": "1.0.1",
"description": "multilingual model",
"model_format": "TORCH_SCRIPT",
"model_config": {
"model_type": "bert",
"embedding_dimension": 384,
"framework_type": "sentence_transformers"
},
"url": "https://artifacts.opensearch.org/models/ml-models/huggingface/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/1.0.1/torch_script/sentence-transformers_paraphrase-multilingual-MiniLM-L12-v2-1.0.1-torch_script.zip?raw=true"
}
#Use above task id to get model_id. Replace the {task_id} with above output in the following command. Note down the model_id
GET /_plugins/_ml/tasks/{task_id}
#Load the model for inference. Replace {model_id} with above output in the following command. Note down the task_id
POST /_plugins/_ml/models/{model_id}/_load
#Check the state of model loading (wait for a "COMPLETED" state). Replace {task_id} with above output in the following command
GET /_plugins/_ml/tasks/{task_id}
#Here, we create one pipeline to convert "description" field into vector and store the vector data into "description_vector" field.
#The pipeline uses the model we uploaded just now. Replace {model_id} in the following command
PUT _ingest/pipeline/neural_pipeline
{
"description": "Multilingual search search pipeline",
"processors" : [
{
"text_embedding": {
"model_id": "{model_id}",
"field_map": {
"description": "description_vector"
}
}
}
]
}
#create index
#We create one index named "english_catalog" and enable KNN for this index.
#The index will use the pipeline we created above to convert "description" field into vector and stored at "description_vector" field.
#"description_vector" is KNN vector which has 384 dimention same as the model output dimention
PUT /english_catalog
{
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 1
},
"index.knn": true,
"index.knn.space_type": "cosinesimil",
"default_pipeline": "neural_pipeline",
"analysis": {
"analyzer": {
"default": {
"type": "standard",
"stopwords": "_english_"
}
}
}
},
"mappings": {
"properties": {
"description_vector": {
"type": "knn_vector",
"dimension": 384,
"method": {
"name": "hnsw",
"space_type": "l2",
"engine": "faiss"
},
"store": true
},
"description": {
"type": "text",
"store": true
}
}
}
}
#We use bulk index to ingest data in English.
#This is just an example of several docs.
#In this stage you should index your entire corpus.
POST english_catalog/_bulk/
{"index": {"_index": "multilingual_english_catalog", "_id": 0}}
{"gender": "Men", "masterCategory": "Apparel", "subCategory": "Topwear", "articleType": "Shirts", "baseColour": "Navy Blue", "season": "Fall", "year": "2011", "usage": "Casual", "productDisplayName": "Turtle Check Men Navy Blue Shirt", "description": "Men Apparel Topwear Shirts Navy Blue Fall 2011 Casual Turtle Check Men Navy Blue Shirt"}
{"index": {"_index": "multilingual_english_catalog", "_id": 1}}
{"gender": "Men", "masterCategory": "Apparel", "subCategory": "Bottomwear", "articleType": "Jeans", "baseColour": "Blue", "season": "Summer", "year": "2012", "usage": "Casual", "productDisplayName": "Peter England Men Party Blue Jeans", "description": "Men Apparel Bottomwear Jeans Blue Summer 2012 Casual Peter England Men Party Blue Jeans"}
{"index": {"_index": "multilingual_english_catalog", "_id": 2}}
{"gender": "Women", "masterCategory": "Accessories", "subCategory": "Watches", "articleType": "Watches", "baseColour": "Silver", "season": "Winter", "year": "2016", "usage": "Casual", "productDisplayName": "Titan Women Silver Watch", "description": "Women Accessories Watches Watches Silver Winter 2016 Casual Titan Women Silver Watch"}
{"index": {"_index": "multilingual_english_catalog", "_id": 3}}
{"gender": "Men", "masterCategory": "Accessories", "subCategory": "Eyewear", "articleType": "Sunglasses", "baseColour": "Gold", "season": "Winter", "year": "2016", "usage": "Casual", "productDisplayName": "Pal Zileri Men Casual Gold Frame Sunglasses", "description": "Men Accessories Eyewear Sunglasses Gold Winter 2016 Casual Pal Zileri Men Casual Gold Frame Sunglasses"}
{"index": {"_index": "multilingual_english_catalog", "_id": 4}}
{"gender": "Men", "masterCategory": "Accessories", "subCategory": "Eyewear", "articleType": "Sunglasses", "baseColour": "Silver", "season": "Winter", "year": "2016", "usage": "Casual", "productDisplayName": "Louis Philippe Men Grey Sunglasses", "description": "Men Accessories Eyewear Sunglasses Silver Winter 2016 Casual Louis Philippe Men Grey Sunglasses"}
{"index": {"_index": "multilingual_english_catalog", "_id": 5}}
{"gender": "Women", "masterCategory": "Accessories", "subCategory": "Eyewear", "articleType": "Sunglasses", "baseColour": "Silver", "season": "Winter", "year": "2016", "usage": "Casual", "productDisplayName": "Polaroid Women Sunglasses", "description": "Women Accessories Eyewear Sunglasses Silver Winter 2016 Casual Polaroid Women Sunglasses"}
#We can now search in multiple languages.
#Replace the query_text with the following examples:
# משקפי שמש זהב נשים
# משקפי שמש זהב גברים
# gafas de sol hombre oro
GET /english_catalog/_search
{
"_source": [ "description", "subCategory"],
"size": 100,
"query": {
"neural": {
"description_vector": {
"query_text": "משקפי שמש זהב נשים",
"model_id": "f5kKhogBA4YjKtOqaHka",
"k": 10000
}
}
}
}
Troubleshooting
Make sure your cluster has enough memory in your data nodes so that it could load the model. For me it worked with 4 x r6g.4xlarge.search data nodes.
Resources
https://catalog.workshops.aws/semantic-search/en-US
https://opensearch.org/docs/latest/search-plugins/neural-search/
https://opensearch.org/docs/latest/ml-commons-plugin/ml-framework/
Relevant content
- asked a year agolg...
- asked a year agolg...
- asked 2 months agolg...
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago