"Scalable Clustering of Categorical Data and Applications"
 
Περίληψη: 
Clustering is a problem of great practical importance in numerous 
applications. The problem of clustering becomes more challenging when the 
data is categorical, that is, when there is no inherent distance measure 
between data values. In this talk, we introduce LIMBO, a scalable 
hierarchical categorical clustering algorithm that uses an intuitive 
information-theoretic distance measure for categorical tuples and values. 
When clustering values, LIMBO can give useful hints about potential 
duplication and errors that may exist in a data set. As a hierarchical 
algorithm, LIMBO has the advantage that it can produce clusterings of 
different sizes in a single execution and within a memory bounded summary 
model for the data. We present results from our experimental evaluation of 
LIMBO, which show the increase in efficiency without significant loss in 
the quality of the produced clusterings. We move on to show how the 
algorithm can be used to produce valid and useful clusterings of large 
software systems. In this case, LIMBO is applied in the presence of both 
structural and non-structural information about the software systems and, 
thus, allows for an evaluation of their usefulness in understanding them. 
Finally, we conclude the talk with a set of research challenges that 
present themselves for the future.

 
Βιογραφικό σημείωμα:

Periklis Andritsos received his B.Sc. degree in Electrical and Computer 
Engineering in 1998 from the National Technical University of Athens, Greece. 
In 2000, he received the M.Sc. degree in Computer Science and in 2004 the Ph.D. 
degree in Computer Science, both from the Department of Computer Science 
at the University of Toronto. In 2004/2005 he was a Post-doctoral fellow at the 
University of Toronto. Currently, he is a faculty member at the University of 
Trento. His research interests include database systems, data mining, 
clustering and reverse engineering. He is a member of the IEEE Computer
Society and the ACM.