Conferencia Plenaria: Crowdsourcing Fine-Grained Relevance Judgments

Effectiveness evaluation by means of a test collection is a standard methodology in information retrieval, with a long history. To gather relevance labels, the classical approach used in TREC-like initiatives was to use binary relevance judgments expressed by trained assessors. Two more recent trends are to rely on workers from the crowd as assessors, and to adopt multi-level relevance judgments, as well as gain-based metrics leveraging such multi-level judgment scales.
After a brief introduction to test collection based evaluation, I will report on two experiments focusing on such fine-grained relevance scales. In some recent work (ACM SIGIR 2015, ACM TOIS 2017) we proposed unbounded relevance scales by means of magnitude estimation and compared them with multi-level scales.
While magnitude estimation brings advantages like the ability for assessors to always judge the next document as having higher or lower relevance than any of the documents they have judged so far, it also comes with some drawbacks. For example, it is not a natural approach for untrained assessors to judge items as they are used to do on the Web (e.g., 5-star rating). In another more recent work (ACM SIGIR 2018) we proposed to collect relevance judgments over a 100- level relevance scale, a bounded and fine-grained scale having many of the advantages and dealing with some of the issues of magnitude estimation. The two approaches have been experimentally evaluated by means of large-scale crowdsourcing experiments, that compare the two scales with other traditional relevance scales (binary, 4-level). The results show the benefits of fine-grained scales over coarse-grained ones.

Joint work with Shane Culpepper, Gianluca Demartini, Eddy Maddalena, Kevin Roitero, Mark Sanderson, Falk Scholer, Andrew Turpin.

Stefano Mizzaro Università degli Studi di Udine