Theoretical and empirical analysis of similarity measures

Advisor: Dr.E. Amigó Cabrera

In multiple information access tasks such as document clustering, filtering, text evaluation, etc., measuring the similarity between texts is a nuclear issue. We will describe our work in three aspects: how to combine similarity measures, what are the basic axioms of similarity and their empirical effects, and how to exploit similarity training data. Regarding the first issue, I will describe briefly my collaboration in the formal and empirical analysis of unsupervised combining functions. This work is closely related with ranking fusion, voting and averaging techniques Regarding the second issue, it will be described a proposed theory that explain the relations between probabilistic models, set-theoretic models and informationtheoretic models. The resulting axioms will help us to analyze the measures of the state of the art. It will be shown some experiments and it will be pointed out the way to follow. In the ambit of semi-supervised clustering, it will be described a proposal that take into account the content of the texts (direct measure) and the proximity to a set of texts previously grouped. It will be shown the experiments performed.

Fernando Giner Martínez