Search Relevance in Amazon OpenSearch Service part 1: order-unaware metrics
Search relevance is the silent force behind great search experiences—and too often, it's ignored.
Introduction
When users cannot find products they know exist, or when searches return irrelevant results, they abandon your application. These are common search relevance problems affect many organizations—from e-commerce platforms losing sales when products don't appear in results, to document repositories where critical information remains unfound. Measuring search relevance systematically is the first step toward identifying and solving these challenges.
Search engines are the central piece that connects users with their informational needs. OpenSearch, an open-source search engine built on Apache Lucene, handles the key tasks of indexing, query processing, matching, and ranking—providing the essential building blocks for developers to create efficient search systems. Amazon OpenSearch Service is a fully managed service that makes it simple to deploy, scale, and operate OpenSearch in the AWS Cloud.
This post series demonstrates how to measure search relevance through different relevance metrics, each addressing a specific search relevance challenge.
In part 1, we will focus on the challenges of finding all relevant items, filtering out irrelevant results, and balancing coverage and relevance by using three order-unaware relevance metrics: Recall, Precision and F1 Score.
Components of Search Relevance
Measuring search relevance requires understanding the components that work together in a search system. Let's walk through this process using a practical example from an e-commerce store.
When a shopper types ‘white t-shirt’ into the search bar, several components work together to deliver and evaluate the results (Figure 1). Think of it as a conversation between the shopper and the store. The shopper expresses what they want (the query), the store responds with suggestions (the response), and we have a way to measure how relevant these suggestions are to the shopper's needs (ground truth).
Figure 1. Components for measuring search relevance.
Let's examine how these components in practice. When our shopper searches for a ‘white t-shirt’, this becomes a query—the starting point for measuring relevance. The search engine processes this query and returns a response—a ranked list of products. In OpenSearch, you can control the size of the response with the size parameter.
But how do we know if these results are relevant? This is where ground truth comes in—our benchmark for what constitutes a relevant response. Think of ground truth as an expert buyer marking each product as either relevant or not relevant to ‘white t-shirt’. Sometimes we use a simple yes/no approach (binary relevance), like marking only pure white t-shirts as relevant. Other times, we need more nuance (graded relevance), perhaps rating products on a scale from 0 to 5, where white t-shirts get a 5, off-white shirts might get a 4, and unrelated items get a 0.
With these components in place, we can measure relevance – how well our search results match what the shopper wanted. Relevance metrics fall under two categories, depending on what they measure. Order-aware metrics measure the order of results, ensuring the best matches appear first. Order-unaware metrics only measure if the right items are included, regardless of their position
We also need to decide how many results to evaluate. This evaluation stage, often written as K, helps us focus on what matters most. For example, if we set K=10, we're only looking at the first ten results - similar to how shoppers typically focus on the first page of search results.
To make this concrete, let's see how these components work together to measure search relevance in our e-commerce example.
Measuring Relevance in a Sample Search Scenario
Let's continue with our e-commerce store to see how these relevance metrics work in practice. Imagine you are responsible for improving the product search experience. Your team has received feedback that the search results are not meeting customer expectations, so you decide to measure the search relevance systematically.
Starting with a common search scenario, you measure how your system ranks the responses to the query ‘white t-shirt’. This everyday search request provides an excellent way to understand how different relevance metrics capture various aspects of search relevance.
Setting Up the Ground Truth
Our e-commerce store search system currently returns seven product images for the ‘white t-shirt’. Using the binary relevance approach (relevant/not relevant), the merchandising team creates the ground truth, identifying three products as relevant to the query (Figure 2).
Figure 2. Binary ground truth for the query ‘white t-shirt’ indicating relevant and non-relevant products.
Simulating the Search Engine Results
When a customer enters ‘white t-shirt’ into your search bar, your system returns a ranked list of seven products (Figure 3). The order matters—customers typically focus on the first few results, so getting these right is crucial.
Figure 3. Ranked list of documents produced by the search engine for the query ‘white t-shirt’ with size=7
Finally, we need to define the evaluation stage K for our relevance measurements. Let's set K=2 since many users never scroll past the first few items (Figure 4).
"Figure 4. Ranked results for the query ‘white t-shirt’ showing evaluation stage K=2.")
Figure 4. Ranked results for the query ‘white t-shirt’ showing evaluation stage K=2.
This scenario provides all components needed to measure search relevance:
- The query (‘white t-shirt’)
- The response (seven ranked products)
- The ground truth (relevance judgements)
- The evaluation stage (K=2)
This scenario demonstrates common search relevance challenges. Results may miss relevant t-shirts, include irrelevant items, or rank products suboptimally. Each challenge requires specific metrics to measure and improve search relevance. Let's examine how different metrics measure these distinct aspects of search relevance in OpenSearch.
Challenge 1: Search Misses Available Products
When searching for ‘white t-shirt’, the system returns one relevant item in the first two results (K=2), while our ground truth identifies three relevant products. This demonstrates a relevance gap – the system fails to return all relevant products in the top results.
Recall (R@K) directly addresses this challenge by measuring the proportion of relevant documents retrieved among the top K results, compared to all relevant documents in the dataset's ground truth. In simpler terms: What percentage of all relevant documents appear in the top K results?"
Let's calculate recall for our ‘white t-shirt’ query. With K=2, we measure recall using:
A recall of 0.33 indicates the system returns one-third of all relevant products in the top two results. Recall typically increases with K as more relevant documents appear in the results. One way to maximize it is by broadening the search—for example, using more general keywords, expanding synonyms, or loosening filters. When K=7, our system finds all three relevant items, achieving perfect recall but introducing more irrelevant results.
Challenge 2: Search Shows Irrelevant Products
Our example shows another critical issue: the search returns products unrelated to white t-shirts. This dilutes search quality and creates a poor user experience.
Precision (P@K) directly addresses this challenge by measuring the proportion of relevant documents within the top K retrieved results. In simpler terms: What fraction of the top K retrieved documents are relevant?
Let's calculate precision for our ‘white t-shirt’ query. With K=2, we measure precision using:
A precision of 0.5 indicates that half of the top two results are relevant. While we could achieve perfect precision by showing only exact matches—like limiting results to items explicitly labeled ‘white t-shirt’—this approach might be too restrictive. In OpenSearch, you can improve precision while maintaining result coverage through field boosting to prioritize exact matches, custom analyzers to handle variations in product descriptions, and query refinement to better interpret search intent.
Challenge 3: Balancing Coverage and Relevance
Our search faces a dilemma. Our recall of 0.33 shows we're missing two-thirds of relevant products, while our precision of 0.5 indicates half of the returned results are irrelevant. Optimizing for recall alone could flood results with irrelevant items, while focusing solely on precision might hide valid matches. We need a way to measure both aspects simultaneously.
F1 Score (F1@K) balances precision and recall, combining them into a single harmonic mean metric. This metric penalizes large differences between precision and recall, measuring how well a system balances both aspects of relevance.
Let's calculate the F1 score for our ‘white t-shirt’ query with K=2, using our previously measured precision@2 of 0.5 and recall@2 of 0.33:
An F1@2 score of 0.4 indicates there is room for improvement in both precision and recall – we are missing relevant white t-shirts and showing unrelated items. Optimizing F1 Score in OpenSearch prevents showing every product in the catalog just to include all white t-shirts, or showing only perfect matches while hiding good alternatives.
Measuring Relevance Across Different Evaluation Stages
We have measured precision, recall, and F1 score at K=2, but in an e-commerce store, you need to decide how many products to show on the first page. Should you display 5 products? 10? 20? Let's measure how these relevance metrics change with different K values to inform this decision (Figure 5).
Figure 5. Recall@K, precision@K and F1@K evaluated with K range [1-7].
Our analysis shows how these metrics vary with K. Precision is highest at K=1, recall reaches its maximum at K=7, and F1 score peaks at K=7 with 0.6. Depending on which search relevance challenge you want to address—missing products (recall), irrelevant results (precision), or balancing both (F1)—you would choose a different K value for your e-commerce store. In our case, if we prioritize overall balance, displaying seven products on the first page would maximize search relevance for ‘white t-shirt’ queries.
Summary
In this post, we first explained the components involved in search relevance. We then built an example to explain how different search relevance metrics reveals different aspects of search relevance, helping you address specific business challenges:
- Use Recall when you need to maximize product discovery and prevent lost sales from unfound items
- Apply Precision to improve user satisfaction by delivering more accurate, relevant results
- Implement F1 Score when you need to optimize both result coverage and accuracy simultaneously
The next post in our series expands on the use of order-aware relevance metrics to better understand the search challenges of optimizing result ordering and handling different degrees of relevance.
- Language
- English
Relevant content
- asked 2 years ago
- asked 3 years ago
