Lecture Notes in Computer Science vol:6804 pages:501-512
International Symposium on Methodologies for Intelligent systems edition:19 location:Warsaw, Poland date:28-30 June 2011
The notion of similarity is crucial to a number of tasks and methods in machine learning and data mining, including clustering and nearest neighbor classiﬁcation. In many contexts, there is on the one hand a natural (but not necessarily optimal) similarity measure deﬁned on the objects to be clustered or classiﬁed, but there is also information about which objects are linked together. This raises the question to what extent the information contained in the links can be used to obtain a more relevant similarity measure. Earlier research has already shown empirically that more accurate results can be obtained by including such link information, but it was not
analyzed why this is the case. In this paper we provide such an analysis. We relate the extent to which improved results can be obtained to the notions of homophily in the network, transitivity of similarity, and content variability of objects. We explore this relationship using some randomly generated datasets, in which we vary the amount of homophily and content variability. The results show that within a fairly wide range of values for these parameters, the inclusion of link information in the similarity measure indeed yields improved results, as compared to computing the similarity of objects directly
from their content.