Search Relevance in Amazon OpenSearch Service part 2: order-aware metrics
Search relevance is the silent force behind great search experiences—and too often, it's ignored.
Introduction
In Part 1 of our series, we discussed how the binary relevance metrics of Recall, Precision and F1 Score can help us to better understand the search challenges of finding all relevant items, filtering out irrelevant results, and balancing coverage and relevance.
In this post, we discuss how the graded relevance metrics of Mean Average Precision (MAP@K) and Normalized Discounted Cumulative Gain (NDCG@K) can help us to better understand the search challenges of optimizing result ordering and handling different degrees of relevance.
Sample Search Scenario Recap
As we contextualize our discussion of graded relevance metrics, let's quickly revisit our 'white t-shirt' search scenario from Part 1. Here's how our search engine ranked the results, with K=2 representing our evaluation stage:
Figure 1. Ranked results for the query ‘white t-shirt’ showing evaluation stage K=2.
Challenge 4: Results in the Wrong Order
Our previous metrics measure which results appear in the top K, but not their order. For example, in our e-commerce store, when blue shoes appear before white t-shirts, precision and recall values stay the same—yet for shoppers, the search relevance is clearly worse. A shopper looking for white t-shirts might not scroll past irrelevant results to find what they need.
Mean Average Precision (MAP@K) is an order-aware relevance metric that measures how effectively a system retrieves relevant documents at the top of the result list. This is especially important in scenarios where users tend to focus on the first few results. MAP provides a single score by combining precision across multiple positions in the ranking, highlighting the system’s ability to prioritize relevant documents.
The term "mean average" in MAP reflects its two-step calculation process: first computing the average precision for each query, then calculating the mean of these averages across multiple queries. In our case, with a single query ‘white t-shirt’, we only need to calculate one average precision.
In our example, which has a single query ‘white t-shirt’, the calculation requires only one average precision. The calculation has two parts:
- Divisor - The total number of relevant results in our ground truth, that is 3.
- Dividend (Figure 2) - Combines the precision@K metric with the relevance parameter rel_k
The rel_k parameter indicates the relevance of the k-th ranked item, where:
- rel_k = 1 for relevant items
- rel_k = 0 for non-relevant items.
Figure 2. rel_k with K range [1-7]
Finally, we do the math:
MAP produces values between 0 and 1, where 1 indicates perfect ranking (all relevant products at the top). Our MAP score of 0.33 indicates our system ranks relevant products relatively bad - there's room for improvement. Use MAP when you need to measure and optimize how well your system positions relevant results at the top of the search results.
Challenge 5: Not All Matches Are Equal
MAP treats relevance as binary—a result is either relevant or not. However, not all mismatches affect search relevance equally. For example, showing an off-white t-shirt to someone searching for a white t-shirt might be slightly off but still useful, while showing blue shoes is completely irrelevant and might cause shoppers to abandon their search entirely.
Normalized Discounted Cumulative Gain (NDCG@K) provides a nuanced measure that considers both the graded relevance of documents and their position in the ranking. The following three features make NDCG the most used relevance metric in scientific research:
- Graded relevance - Measures ranking quality by prioritizing highly relevant documents in top positions where you are most likely to see them.
- Normalization - Enables fair comparison across different queries and datasets. This accounts for variations in relevant document numbers and maintains metric consistency across contexts.
- Interpretability - Provides scores between 0 and 1, where 1 represents perfect ranking with all relevant documents optimally ordered at the top.
NDCG@K expands beyond binary relevance (relevant/not relevant) to use graded relevance ranks. The sample dataset ground truth changes from binary classification to a graded relevance scale of [0 → 2] (Figure 7). In this scale, 0 represents the least relevant results, while 3 indicates the most relevant matches.
Figure 3. Returned results from our search engine to the query ‘white t-shirt’. Ground truth tags to the query divided into a relevance range [0 → 2].
The NDCG@2 calculation follows these four steps:
- Calculate the cumulative gain (CG@2). CG represents the sum of document relevance scores without position-based discounts.
While cumulative gain measures basic relevance, it has two limitations:
- Lacks order awareness - Swapping positions of results 1 and 2 produces the same score.
- Lacks normalization - Scores aren't standardized.
Discounted cumulative gain (DCG@K) addresses the first limitation by adding position-based penalties to the calculation.
- Calculate the discounted cumulative gain (DCG@2). Apply position-based penalties to create order awareness.
When reordering the results to place less relevant items before more relevant ones, the DCG@2 score decreases. Let's imagine that the irrelevant item would have come before the relevant one:
DCG@K addresses order awareness but still lacks normalization. The values depend on the chosen relevance range—using [0 → 5] instead of [0 → 2] produces different results. Computing the Ideal Discounted Cumulative Gain (IDCG), which represents the optimal ranking order, enables normalization for NDCG calculation.
- Calculate the ideal discounted cumulative gain (IDCG@2). Arrange items by descending relevance to determine the optimal ranking order (Figure 4).
Figure 4. Ranked results that achieve an Ideal DCG@K
IDCG@2 calculates the score using 2 items with relevance value of 2:
- Normalize DCG@2 using IDCG@2. Divide DCG@2 by IDCG@2 to obtain the final standardized score.
NDCG normalization enables comparison between systems using different relevance scales. For example, systems using ranges [0 → 3] and [0 → 10] produce comparable NDCG values because each score scales relative to its ideal ranking. This standardization ensures consistent evaluation across different graded relevance schemes.
Summary
Building on our search scenario from Part 1, we've explored two sophisticated, order-aware relevance metrics that provide deeper insights into search result quality. MAP enables us to account for result ordering using a binary relevance ground truth, while NDCG leverages a graded relevance ground truth to measure both the order of results and their degree of relevance.
These metrics offer powerful tools for evaluating and improving search performance, but their true value lies in consistent application. Implementing these relevance metrics in your search systems is not a one-time task, but rather an ongoing process of refinement. By regularly measuring and analyzing these metrics, you can continually optimize your search functionality to better meet your users' evolving informational needs.
As you move forward, remember that the goal is not just to improve numbers, but to enhance the overall search experience. Let these metrics guide you in creating more intuitive, efficient, and satisfying search results for your users.
Relevant content
- asked 2 years ago
