EVENT

Event News

NII SEMINAR SERIES ON DIMENSIONALITY AND SCALABILITY IV

Outline

Date:
Monday 23 March, 2015
Time:
10:30-12:30 AND 13:50-16:50
Place:
Room 2009-2010 (GRACE Center, 20th floor), National Institute of Informatics
Access

Talk 1

Time:
10:30-11:10
Title:
The Blind Men and the Elephant
Presenter:
PD Dr. Arthur Zimek
Database and Information Systems
Institut für Informatik
Ludwig-Maximilians-Universität München (LMU Munich), Germany
Abstract:
In this talk, we consider the fundamental problems of different sub-fields of clustering, especially focusing on subspace clustering, ensemble clustering, alternative (as a variant of constraint) clustering, multiview clustering (as a variant of alternative clustering), and clustering uncertain data. We discuss how these different branches of research on clustering, while rather different at first glance, in fact have a lot in common and can learn a lot from each other's solutions and approaches.

Talk 2

Time:
11:10-11:50
Title:
Outlier Detection in Large and High-dimensional Data Sets and with Multiple Feature Spaces
Presenter:
Dr. Erich Schubert
Database and Information Systems
Institut für Informatik
Ludwig-Maximilians-Universität München (LMU Munich), Germany
Abstract:
High-dimensional data offers new challenges for outlier detection, but also allows for interesting new approaches. We will first look at a new approach to scale nearest-neighbor based outlier detection algorithms to larger data sets and high dimensionality using random projections, space-filling curves, and clever data redistribution.
High dimensionality can arise from having the data represented in many feature spaces. However, these feature spaces may have different meaning and in particular different scale. Simply concatenating the features may then not yield much of an improvement, unless we pay attention to carefully normalizing the data. To improve results, we discuss normalization concepts that take the intrinsic dimensionality of the feature spaces into account.

Talk 3

Time:
11:50-12:30
Title:
Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection
Presenter:
Milos Radovanovic
Assistant Professor
Department of Mathematics and Informatics, Faculty of Science
University of Novi Sad, Serbia
Abstract:
Outlier detection in high-dimensional data presents various challenges resulting from the "curse of dimensionality." A prevailing view is that distance concentration, i.e., the tendency of distances in high-dimensional data to become indiscernible, hinders the detection of outliers by making distance-based methods label all points as almost equally good outliers. We provide evidence supporting the opinion that such a view is too simple, by demonstrating that distance-based methods can produce more contrasting outlier scores in high-dimensional settings. Furthermore, we show that high dimensionality can have a different impact, by reexamining the notion ofreverse nearest neighbors in the unsupervised outlier-detection context. Namely, it was recently observed that the distribution of points'reverse-neighbor counts becomes skewed in high dimensions, resulting in the phenomenon known as hubness. We provide insight into how some points (antihubs) appear very infrequently in k-NN lists of other points, and explain the connection between antihubs, outliers, and existing unsupervised outlier-detection methods. By evaluating the classic k-NN method, the angle-based technique (ABOD) designed for high-dimensional data, the density-based local outlier factor (LOF) and influenced outlierness (INFLO) methods, and antihub-based methods on various synthetic and real-world data sets, we offer novel insight into the usefulness of reverse neighbor counts in unsupervised outlier detection.

:::::::::::::::  Lunch  :::::::::::::::

Talk 4

Time:
13:50-14:30
Title:
A Data Dependent Dissimilarity Measure Based on Mass
Presenter:
Takashi Washio
Professor
The Institute of Scientific and Industrial Research
Osaka University, Japan
Abstract:
Nearest neighbour search is a core process in many data mining algorithms. Finding reliable closest matches of a query in a high dimensional space is still a challenging task. This is because the effectiveness of many dissimilarity measures, that are based ona geometric model, such as Lp-norm, decreases as the number of dimensions increases.
In this work, we examine how the data distribution can be exploited to measure dissimilarity between two instances andpropose a new data dependent dissimilarity measure called ‘mp-dissimilarity’. Rather than relying on geometric distance, it measures the dissimilarity between two instances in each dimension as a probability mass in a region that encloses the two instances. It deems two instances in a sparse region to be more similar than two instances in a dense region, though these two pairs of instances have the same geometric distance.
Our empirical results show that the proposed dissimilarity measure indeed provides a reliable nearest neighbour search in high dimensional spaces, particularly in sparse data. mp-dissimilarity produced better task specific performance than Lp-norm and cosine distance in classification and information retrieval tasks.

Talk 5

Time:
14:30-15:10
Title:
An Extreme-Value-Theoretic Formulation of Inlierness and Outlierness
Presenter:
Michael E. Houle
Visiting Professor
National Institute of Informatics, Japan
Abstract:
This presentation is concerned with a theoretical characterization of the inlierness / outlierness of data, in terms of a recently-proposed continuous model of the intrinsic dimensionality (ID) of the underlying data set. Continuous ID can be regarded as an extension of Karger and Ruhl's expansion dimension to a statistical setting in which the distribution of distances to a query point is modeled in terms of a continuous random variable. This form of intrinsic dimensionality is particularly useful in search, classification, clustering, and other contexts in machine learning, databases, and data mining, as it has been shown to be equivalent to a measure of the discriminative power of similarity functions. For distances restricted to the lower tail of their distribution, several estimators ofcontinuous ID have recently been proposed and analyzed in light of extreme value theory (EVT).
Here, we present two theoretical results, and discuss their potential for practical impact in applications in such areas as databases, data mining, multimedia, and machine learning:
(1) Under reasonable assumptions, any continuous distance distribution admits a precise representation in terms of its ID. This representation allows the cumulative probability distribution to be expressed solely in terms of continuous ID, without explicit knowledge of the underlying probability densities.
(2) Using the ID representation of distance distribution, a natural measure of inlierness and outlierness of a data point can derived, normalized with respect to the intrinsic dimensionality. The measure is shown to be related to the second-order parameter of recent study in the area of EVT, and can be regarded as a measure of the curvature of an underlying manifold with respect to the data point.

:::::::::::::::  Break  :::::::::::::::

Talk 6

Time:
15:30-16:10
Title:
Exploiting High Dimensionality for Similarity Search
Presenter:
Dr. Teddy Furon
Research Scientist
Equipe LINKMEDIA
INRIA/IRISA Rennes, France
Abstract:
We design an indexing architecture to store and search in a set of high-dimensional vectors. This architecture is composed of several memory units, each of which summarizes a fraction of the database by a single representative vector. The membership of the query tothe memory unit is tested by a simple correlation with its representative vector. This vector optimizes the membership hypothesis test when the query vectors are simple perturbations of the stored vectors.
Our approach finds the most similar database vectors within a fraction of the cost of an exhaustive search without noticeably reducingthe search quality. Interestingly, the gain of complexity is provably better in high-dimensional spaces. We empirically demonstrate its practical interest in a large-scale image search scenario. Our evaluation is carried out on standard benchmarks for large-scale image search and uses off-the-shelf state-of-the-art descriptors.

Talk 7

Time:
16:10-16:50
Title:
Efficient Algorithms for Similarity Search in Axis-Aligned Subspaces
Presenter:
Vincent Oria
Associate Professor
Department of Computer Science
New Jersey Institute of Technology, NJ, USA
Abstract:
Many applications - such as content-based image retrieval, subspace clustering, and feature selection - may benefit from efficient subspace similarity search. Given a query object, the goal of subspace similarity search is to retrieve the most similar objects from the database, where the similarity distance is defined over an arbitrary subset of dimensions (or features) - that is, an arbitrary axis-aligned projective subspace. Though much effort has been spent on similarity search in fixed subspaces, relatively little attention has been given to the problem of similarity search when the dimensions are specified at query time. In this presentation, we propose several new methods for the subspace similarity search problem. Extensive experiments are provided showing very competitive performancerelative to state-of-the-art solutions.
All are welcome.
For more information, please contact:
Contact

Michael Houle, Visiting Professor, National Institute of Informatics
E-mail: meh[at]nii.ac.jp
*Please replace [at] with @.

entry1375

SPECIAL