Guide to Data Labeling for Search Relevance Evaluation

Photo by Markus Winkler on Unsplash

Machine Learning (ML) has a number of applications in modern commerce, with Information Retrieval (IR) being one of the most common. Many e-businesses use it to gauge search quality relevance on their platforms to provide better services to users.

As a rule of thumb, when dealing with IR, the larger the corpus, the larger the evaluation dataset has to be and the more rigorous the evaluation process. One of the most effective ways to evaluate search relevance is through human-in-the-loop data labeling, of which crowdsourcing is our methodology of choice.

Search quality evaluation

Any IR system works by providing search query output from the most to the least relevant when phrases are typed in the search bar. Normally, the user looks at the first 4–5 returns, chooses something fitting, and ignores the rest.

With that in mind, it’s crucial that the system consistently rank web search results taking into account their relevance, timeliness, and scalability, with a number of offline metrics serving to achieve this goal. NDCG (Normalized Discounted Cumulative Gain) is a ranking quality measure that’s often utilized to measure web search engine effectiveness. In this scenario, the most ideal system lists items with the highest grade first.

NDCG converges to 1 as the number of ranked items approaches infinity. The following can be done, with both approaches introducing hyper-parameters of discount function or dataset size:

• Changing the discount function from log to a feasible one, e.g., 𝑟^(−1/2).

• Limiting the number of top-𝑘 retrieved documents, i.e., using NDCG@k.

Whatever the chosen route, adequate sampling is key to obtaining reliable results.

Query sampling

Which queries exactly should be used in offline evaluation? Some argue that it’s the most popular queries that web users make on a regular basis. But what about those queries that are made less frequently, possibly even once?

The truth is that these one-off queries are usually quite challenging to process as they’re often ambiguous. However, if our service in question cannot perform well on these queries, the user may very well switch to another company that’s capable of handling more complex tasks. And between popular and unique queries, there also exists a wide range of mid-range queries.

Since we want our service to perform equally well on all types of queries, we want to get a balanced sample from all of them. The simplest way to achieve this is to toss a coin on every query to determine its inclusion into the sample. The obvious shortcoming of this approach is that we cannot actually control which queries are taken and discarded, so in theory the most popular query may be tossed away in the process. In addition, we cannot control our sample size with this method.

Reservoir sampling is a more sophisticated sampling approach. The main idea is that we keep a list of objects (with its length being equal to the required sample size), go through the queries, toss a coin, and subsequently replace an object on our list with a new one or skip this object altogether. While this approach offers an improved technique pertaining to sample properties, it still cannot guarantee our sample’s representativeness.

To make sure that all query types are present in our sample, we can instead use stratified sampling. To do that, we split queries into different buckets that have roughly the same number of non-unique-user queries but a different number of unique-user queries. Buckets are ordered in a way that any query from a smaller bucket will have a lower frequency than that of a greater bucket. After splitting is done, we can sample our queries from each bucket. And if our sample is big enough, we’ll get the most popular queries from our largest bucket.

Having acquired our query samples, we can use them in 3 key areas of offline evaluation:

• KPI (Key Point Indicator) measurement: an overall daily service evaluation used to measure how well our service is performing.

• Validation basket: frequent testing, including but not limited to A/B experiments and fine-tuning of our service.

• Test basket: a separate basket for pre-release checks that we can use to combat overfitting in the KPI basket.

Every basket is sampled independently using the same basic technique. One thing to take into account is that long-term evaluation requires that we frequently update our baskets.

Application of crowdsourcing

With crowdsourcing, independent judgements from human annotators are used to evaluate performance of ecommerce platforms and other web services. Naturally, we want to obtain as fair an estimation of the platform’s search relevance quality as possible. And since artificial signals are prone to overfitting — especially when used in production — we need to obtain an independent signal source.

Human judgements are much more robust when it comes to overfitting, allowing us to measure search relevance without being influenced by any factors built into the system. Moreover, we can use a mixture of public and managed/in-house crowds to acquire the best results, which will in turn let us scale our measurements by an order of magnitude.

Crowdsourcing tasks

Our offline quality evaluation pipeline allows us to obtain accurate results within days, in some cases even hours. But the more complex the evaluation process, the longer it takes to obtain the results normally.

As to the specifics, side-by-side (SbS) — or pairwise comparison — is one of the most common crowdsourcing tasks. The contributors are asked to compare search results against one another in the form of image or text — two at the time — and choose the most suitable one.

It’s important to keep in mind that the crowd is constantly changing and evolving. As a result, the most straightforward approach to offset that and get a more balanced view is to select a set of validation verdicts and compare it to newer labels. While it will measure markup quality, the resulting value of quality estimation is subject to several factors, namely:

• Clarity of instructions and abundance of examples, i.e., whether annotators understand everything and there are no contradictions or gray areas.

• Quality of golden (pre-annotated) tasks, i.e., new annotators might be submitting correct answers, but their judgements won’t match the golden set if the latter is itself compromised.

• Structure of golden tasks which — when structured poorly (either too simple or too complex) — will result in an over- or underestimation of search quality.

• Annotator selection, i.e., how fit the labelers are to tackle a particular task, both in terms of their background/preparedness and reliability/track record.

• Annotator motivation, i.e., what incentives are offered to the contributors to boost and reward their diligence and speed.

Aggregation of the answers

Once we’ve obtained the answers from our crowd annotators, a number of techniques can be used to aggregate these responses, but not all of them are equally suitable in every situation. Whereas in many classification tasks, there’s an inbuilt assumption of only one correct response, pairwise comparison tasks required for search query ranking and evaluation are often based on subjective opinions. To take this important factor into account, we can use the Bradley–Terry model for aggregation.

In this model, each item has a latent quality score. We can infer the scores by computing preference probabilities and subsequently use those to rank our items. However, one of the pitfalls of this model is that all of the annotators are assumed to be equally diligent and truthful, which isn’t always the case in practice.

Among the many solutions to this problem is the integration of NoisyBT into our model that treats not only items but also annotators as having quality parameters — reliability and bias. Probabilistic likelihood is then modified by applying the gradient descent method as is demonstrated below. For those looking for simplicity, Crowd-Kit can help with aggregation by offering all of the aforementioned algorithms in an open-source Python library.

Selecting Pairs

To conduct a pairwise comparison task, we need to select paired items. The question is how do we do that? To know which ones to select, we can simply compute all possible combinations (for n objects there will be asymptotically n² pairs); however, this method is too expensive.

Instead, we can compute a smaller yet reasonable subset pairs similarly how the merge sort algorithm runs such pairwise comparisons in O(n log n). Our experience shows that it is sufficient to sample k · n log n pairs of these objects, where the hyper-parameter k has to be tuned (usually, it does not exceed 12).

Conclusion

The quality of ML products very much relies on the data that’s being used, namely its quality, quantity, and prompt updates. Crowdsourcing can offer all of the above in a way that’s affordable, fast, and manageable at any scale.

With crowdsourcing, there’s a trade-off between data accuracy and dataset size, so those attempting to utilize crowdsourcing need to consider their long- and short-term goals. Generally speaking, crowdsourcing allows for large dataset volumes and tends to provide deeper insights into IR and search query relevance than managed/in-house crowds. Having said that, crowdsourcing and managed crowds are not mutually exclusive — the two can complement each other within a single pipeline, which is, in fact, a recommended strategy.

Useful resources

Text Retrieval Conference Data: https://trec.nist.gov/data.html

Toloka Aggregation Relevance Datasets: https://toloka.ai/datasets

This blog has been republished by AIIA. To view the original article, please click HERE.

Guide to Data Labeling for Search Relevance Evaluation

Recent Posts

Recent Comments

Archives

Categories

Meta