CS369C: Clustering Algorithms

CS369C: Clustering Algorithms
Nina Mishra

Course Overview

One of the consequences of fast computers, the Internet and inexpensive storage is the widespread collection of data from a variety of sources and of a variety of types. Sources of data include web click streams, financial transactions, and observational science data. Data types include categorical vs. numerical, static vs. dynamic, points in a metric space vs. vertices in a graph. The nagging question often posed about these data sets is: Can we find something interesting that we did not already know? The first answer to this question is often: Let's try clustering the data! Indeed, clustering is one of the most widely used tools for analyzing data sets. Some modern applications of clustering include clustering the web, clustering search results, clustering click streams, customer segmentation, and community discovery in social networks.

Because of its recent ubiquitous applicability, the field of clustering has undergone major revolution over the last few decades characterized by advances in approximation and randomized algorithms, novel formulations of the clustering problem, algorithms for clustering massively large data sets, algorithms for clustering data streams, and dimension reduction techniques. This course will cover these major advances particularly in the context of modern applications.

This course should be of interest to graduate students in computer science and related fields, especially those with an interest in seeing how the field of clustering has evolved and how much further it has to go. Familiarity with basic material in algorithms, databases, probability, etc., at the level of the core undergraduate courses on these topics is assumed.

Grading: Those students interested in taking the course for a letter grade must complete a class project, and scribe one lecture. Those signing up for the pass/fail option will need to scribe a lecture or complete a class mini-project.

Class Project: Deadlines:
4/12: Proposal: 1 page.

�5/10: Progress Report: at most 5 pages

5/24, 5/31: Project Presentations. 30 minutes.

6/7: Final Project Report

Scribe: Each registered student will sign up as the official scribe for a specific lecture. This involves taking detailed notes, reading the background papers, and preparing a set of lecture notes that will be distributed on the web.

Time and Location:

Tuesday: 2:15-4:05pm
Room: 200-013

Office Hours:

Wednesday: 4-5:30pm, 484 Gates

Overview of Lectures (Tentative)

Date	Topic	Lecture Notes	Scribe
March 29, 2005	Introduction, Overview of the Class, Clustering Preliminaries, k-Center	Lecture 1	Scribe 1
April 5, 2005	k-Median Clustering	Lecture 2	Scribe 2
April 12, 2005	Squared Error Distortion/k-Median-squared	Lecture 3	Scribe 3
April 19, 2005	Hierarchical Clustering	Lecture 4	Scribe 4
April 26, 2005	Correlation Clustering	Lecture 5	Scribe 5
May 3, 2005	Spectral Clustering: Guest Lecture: Frank McSherry	Lecture 6	Scribe 6
May 10, 2005	Sublinear Clustering	Lecture 7	Scribe 7
May 17, 2005	Clustering Data Streams	Lecture 8	Scribe 8
May 24, 2005	Class Project Presentations Part I
May 31, 2005	Class Project Presentations Part II

Reading List

k-Center

Clustering to minimize the maximum intercluster distance, T. F. Gonzalez. Theoretical Computer Science, 38: 293-306, (1985).
A unified approach to approximation algorithms for bottleneck problems. D. Hochbaum and D. Shmoys. Journal of the ACM (JACM). Volume 33 , Issue 3 (July 1986). Pages: 533 - 550
Optimal algorithms for approximate clustering. T. Feder, D. Greene. STOC'88. pages 434-444.
Clustering Motion. S. Har-Peled. FOCS, 2001.

k-Median/k-Median-squared/Facility Location

Local Search Heuristics for k-median and Facility Location Problems, V. Arya, N. Garg, R. Khandekar, A.Meyerson, K. Munagala and V. Pandit. STOC 2001. Rohit Khandekar's slides
Improved Combinatorial Algorithms for the Facility Location and k-Median Problems, M. Charikar and S. Guha. FOCS 1999.
Clustering in large graphs and matrices, P. Drineas, R. Kannan, A. Frieze, S. Vempala, and V. Vinay. SODA 1999.
A generalized convergence theorem and characterization of local optimality. S. Selim and M. Ismail. IEEE Transansactions on Pattern Analysis and Machine Intelligence. 6:(1):81-86. 1984.
A simple linear time (1+eps)-approximation algorithm for k-Means clustering in any dimensions A. Kumar, Y. Sabharwal, S. Sen. FOCS 2004.
On coresets for k-means and k-median clustering S. Har-Peled and S. Mazumdar. STOC 2004.
How fast is the k-means Method?. B. Sadri and S. Har-Peled. In Algorithmica, 41(3):185--202, 2005. Also in SODA 05.
A Local Search Approximation Algorithm for k-Means Clustering. T. Kanungo, D. M. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Y. Wu, Computational Geometry: Theory and Applications, 28 (2004), 89-112.

Hierarchical Clustering

Criteria for Polynomial-Time (Conceptual) Clustering, L. Pitt and Reinke. Machine Learning 2:371-396 (1988).
Performance guarantees for hierarchical clustering, S. Dasgupta, COLT, 2002.

Clustering Large Data Sets

Testing of Clustering. N. Alon, S. Dar, M. Parnas, D. Ron. FOCS 2000.
Sublinear Time Algorithms for Metric Space Problems. P. Indyk, STOC 1999.
CURE: An Efficient Clustering Algorithm for Large Databases, S. Guha, R. Rastogi, and K. Shim. SIGMOD 1998. Note: this PDF file requires a huge amount of temp space (over 200Mb).
Clustering Large Datasets in Arbitrary Metric Spaces, V. Ganti, R. Ramakrishnan, J. Gehrke, A. L. Powell, and J.C. French. ICDE 1999.
BIRCH: an efficient data clustering method for very large databases, T. Zhang, R. Ramakrishnan, and M. Livny. SIGMOD 1996.
Sublinear Time Approximate Clustering, N. Mishra, D. Oblinger, and L. Pitt. SODA 2001.
A Framework for Statistical Clustering with a Constant Time Approximation Algorithms for K-Median Clustering. S. Ben-David. p. 415-426. COLT'04.

Clustering Data Streams

Incremental Clustering and Dynamic Information Retrieval, M. Charikar, C. Chekuri, T. Feder and R. Motwani. STOC 1997.
Clustering Data Streams, S. Guha, N. Mishra, R. Motwani and L. O'Callaghan. FOCS 2000.
Clustering Data Streams: Theory and Practice. S. Guha, A. Meyerson, N. Mishra, R. Motwani and L. O'Callaghan. IEEE TKDE,
15(3): 515-528, 2003.
Maintaining Variance and k-Medians over Data Stream Windows, B. Babcock, M. Datar, R. Motwani, and L. O'Callaghan, PODS 2003.
Better Streaming Algorithms for Clustering Problems, M. Charikar, L. O'Callaghan, and R. Panigrahy, STOC 2003.

Spectral Clustering

Spectral Partitioning Works: Planar graphs and finite element meshes, Daniel A. Spielman, Shang-Hua Teng. FOCS 1996.
On the Quality of Spectral Separators, Stephen Guattery, Gary L. Miller. SIAM J. Matrix Anal. Appl. 19(3):701-719.
Fast Monte-Carlo Algorithms for Finding Low-Rank Approximations, A. Frieze, R. Kannan, and S. Vempala, FOCS 1998.
Clustering in large graphs and matrices, P. Drineas, R. Kannan, A. Frieze, S. Vempala, and V. Vinay. SODA 1999.
On clusterings: good, bad and spectral, R. Kannan, S. Vempala, and A. Vetta. FOCS 2000.
Normalized Cuts and Image Segmentation, J. Shi and J. Malik. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888-905, August 2000.
On Spectral Clustering: Analysis and an algorithm, A. Y. Ng, M. Jordan, and Y. Weiss. NIPS 2000.
Optimal outlier removal in high-dimensional spaces, J, Dunagan and S. Vempala. STOC 2001.

Conceptual Clustering

Criteria for Polynomial-Time (Conceptual) Clustering, L. Pitt and Reinke. Machine Learning 2:371-396 (1988).
A New Conceptual Clustering Framework. N. Mishra, D. Ron, and R. Swaminathan. Machine Learning, 2004.

Biclustering

A Monte Carlo Algorithm for Fast Projective Clustering. C. Procopiuc, M. Jones, P. Agarwal, T. Murali. SIGMOD 2002.
Biclustering algorithms for biological data analysis: A survey. S. Madeira and A. Oliveira. IEEE/ACM Transactions on Computational Biology and Bioinformatics
Discovering statistically significant biclusters in gene expression data. A. Tanay, R. Sharan, and R. Shamir. Bioinformatics. 18(1):136-144. 2002
Biclustering of Expression Data. Y. Cheng and G. Church. Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology. pp 93-103. 2000.

Correlation Clustering

Correlation Clustering. A. Blum, N. Bansal, S. Chawla. FOCS'02.
Clustering with Qualitative Information. M. Charikar, V. Guruswami, A. Wirth: FOCS 2003: 524-533. Slides.
Correlation Clustering with Partial Information. E. Demaine and N. Immorlica. in APPROX 2003.

Clustering with Outliers

Algorithms for Facility Location with Outliers, M. Charikar, S. Khuller, D. Mount and G. Narasimhan, SODA 2001.
Testing of Clustering. N. Alon, S. Dar, M. Parnas, D. Ron. FOCS 2000.

Clustering Moving Points

Discrete Mobile Centers. J. Gao, L. Guibas, J. Hershberger, L. Zhang, and A. Zhu . 17th Symposium on Computational Geometry (SoCG). 2001. Journal version. Discrete & Computational Geometry, 30(1), 2003.
Clustering Motion. S. Har-Peled. FOCS, 2001.

SVM Clustering

Support Vector Clustering, A. Ben-Hur, Horn, H. Siegelmann, V. Vapnik. Journal of Machine Learning Research, vol 2, pages 125-137. 2001.

Catalog Segmentation

Approximate the 2-Catalog Segmentation Problem Using Semidefinite Programming Relaxations. D. Xu, Y. Ye, J. Zhang. 2002.
The 2-Catalog Segmentation Problem, Y. Dodis, V. Guruswami, S. Khanna, Symposium on Discrete Algorithms, 1999.
On Two Segmentation Problems. N. Alon and B. Sudakov. Journal of Algorithms. 33:173-184. 1999.
Segmentation Problems. J. Kleinberg, C. Papadimitriou, and V. Raghavan. pages 473-482. STOC 1998.

Community Discovery

Trawling the web for emerging cyber-communities. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins. Computer Networks, 1999.
Inferring Web Communities from Link Topology. D. Gibson, J. Kleinberg, P. Raghavan. UK Conference on Hypertext. 1998.

Axioms of Clustering

An Impossibility Theorem for Clustering. J. Kleinberg, NIPS 2002.

Cluster Evaluation

Comparing Clusterings. M. Meila. COLT'03.

Model-based Clustering

Learning mixtures of arbitrary gaussians, S. Arora, R. Kannan, STOC 2001
Learning mixtures of Gaussians, S. Dasgupta, FOCS 1999
Maximum Likelihood from Incomplete Data via the EM Algorithm. A.P. Dempster, N.M. Laird, and D.B. Rubin. Journal of the Royal Statistical Society, Series B (Methodological), 39(1):1--38, 1977.

Categorical Clustering

ROCK: A Robust Clustering Algorithm for Categorical Attributes. S. Guha, R. Rastogi, K. Shim. ICDE'99. p. 512.
CACTUS�clustering categorical data using summaries. V. Ganti, J. Gehrke, R. Ramakrishnan. KDD'99. Pages: 73 - 83.
Clustering Categorical Data: An Approach Based on Dynamical Systems. D. Gibson, J. Kleinberg, P. Raghavan. VLDB'98. pages 311 - 322.
A New Conceptual Clustering Framework. N. Mishra, D. Ron, and R. Swaminathan. Machine Learning, 2004.

Projective Clustering

On the complexity of locating linear facilities in the plane, N. Megiddo,Tamir. Operations Research Letters, 1:194-197, 1982.
Approximation algorithms for projective clustering. P. Agarwal and C. Procopiuc, Journal of Algorithms, 46 (2003), 115�139.
An approximation algorithm for computing the two-line center. P. Agarwal, C. Procopiuc and K. Varadarajan, Computat. Geom.: Theory and Appls., 26 (2003), 119-28.
Approximation algorithms for k-line center. P. Agarwal, C. Procopiuc and K. Varadarajan, to appear in Algorithmica.
Projective clustering in high dimensions using core-sets. S. Har-Peled, K. Varadarajan, SOCG 2002.

Dimension Reduction

Embeddings: Lectures on Discrete Geometry. J. Matousek. Springer, May 2002.
PCA: Pattern Classification (2nd ed.). R. Duda, P. Hart and D. Stork

Scatter/Gather

Constant Interaction-Time Scatter/Gather Browsing of Large Document Collections. D. Cutting, D. Karger, and J. Pedersen. . Proceedings of the 16th Annual International ACM/SIGIR Conference, Pittsburgh, PA, 1993.
Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results, M. Hearst and J. Pedersen, Proceedings of the 19th Annual International ACM/SIGIR Conference, Zurich, August 1996.
Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, D. Cutting, D. Karger, J. Pedersen, and J. Tukey. Proceedings of the 15th Annual International ACM/SIGIR Conference, Copenhagen, 1992.

Text Clustering

Information-theoretic co-clustering. I. Dhillon, S. Mallela, D. Modha. KDD 2003.
Document clustering using word clusters via the information bottleneck method. N. Slonim, N. Tishby. Research and Development in Information Retrieval. 2000.
A Probabilistic Framework for Semi-Supervised Clustering. S. Basu, M. Bilenko, and R. Mooney. KDD 2004.