NII Technical Report (NII-2006-008E)

Title	A Generic Query-Based Model for Scalable Clustering
Authors	Michael E. Houle
Abstract	This paper presents a generic model for clustering that requires no direct knowledge of the nature or representation of the data. In lieu of such knowledge, the relevant-set clustering (RSC) model relies solely on the existence of an oracle that accepts a query in the form of a data item, and returns a ranked set of items relevant to the query. In principle, the role of the oracle could be played by any similarity search structure, or even a commercial search engine whose ranking function and relevancy scores are kept secret. The quality of cluster candidates, the degree of association between pairs of cluster candidates, and the degree of association between clusters and data items are all assessed according to the statistical significance of a form of correlation among pairs of relevant sets and/or candidate cluster sets. A scalable clustering heuristic based on the RSC model is also presented, and demonstrated for very large, high-dimensional datasets using a fast approximate similarity search structure as the oracle.
Language	English
Published	May 19, 2006
Pages	21p
PDF File	06-008E.pdf