Abstract |
Similarity measures based on distances are usually more or less sensitive to variations within a data distribution, or the dimensionality of a data space. The effects of the notorious `curse of dimensionality' have been studied in data generated by one single mechanism. In this paper, we study the effects of this phenomenon on different similarity measures in the presence of several data distributions as a setting relevant to many data mining, indexing, or similarity search applications. In particular, we assess the performance of shared-neighbor similarity measures, which are secondary similarity measures based on the rankings of data objects induced by some primary distance measure. Our findings are that the use of rank-based similarity measures can result in more stable performance than their associated primary distance measures. |