CS369C:
Clustering Algorithms
Nina
Mishra
Course
Overview
One
of the
consequences of fast
computers, the Internet and inexpensive storage is the widespread
collection of data from a variety of sources and of a variety of
types. Sources of data include web click streams, financial
transactions, and observational science data. Data types include
categorical vs.
numerical, static vs. dynamic, points in a metric space vs. vertices in
a graph. The nagging
question often posed about these data sets is: Can we find something
interesting that we did not already know? The first answer to
this question is often: Let's try clustering the data! Indeed,
clustering is one of the most widely used tools for analyzing
data sets. Some modern
applications of clustering include clustering the web, clustering
search
results, clustering click
streams, customer segmentation,
and community discovery in social
networks.
Because of its recent ubiquitous
applicability, the field of
clustering has undergone
major revolution over the last few decades characterized by advances in
approximation and randomized
algorithms, novel formulations
of the clustering problem,
algorithms for clustering massively large data sets, algorithms for
clustering data streams, and dimension reduction techniques.
This course will cover these
major advances particularly in the context of modern applications.
This
course should be of interest to graduate students in computer science
and related fields, especially those with an interest in seeing how the
field of clustering has evolved and how much further it has to
go.
Familiarity with basic material in algorithms, databases, probability,
etc., at the level of the core undergraduate courses on these topics
is assumed.
Grading: Those students
interested in
taking the course for a letter grade must complete a class project,
and scribe one lecture. Those signing
up for the pass/fail option will need to scribe a lecture or complete
a class mini-project.
Class
Project:
Deadlines:
4/12: Proposal: 1 page.
�5/10: Progress Report: at most 5 pages
5/24, 5/31: Project Presentations. 30
minutes.
6/7: Final Project Report
Scribe: Each
registered
student will sign up as the official scribe for a specific
lecture. This involves taking detailed notes, reading the
background papers, and preparing a set of lecture notes that will be
distributed on the web.
Time and Location:
Office
Hours:
Wednesday:
4-5:30pm, 484 Gates
Overview of Lectures (Tentative)
Date |
Topic
|
Lecture Notes
|
Scribe
|
March 29, 2005
|
Introduction, Overview of the
Class, Clustering Preliminaries, k-Center
|
Lecture
1 |
Scribe
1
|
April 5, 2005
|
k-Median
Clustering
|
Lecture
2 |
Scribe
2
|
April 12, 2005
|
Squared Error
Distortion/k-Median-squared
|
Lecture
3
|
Scribe
3
|
April 19, 2005
|
Hierarchical Clustering |
Lecture
4
|
Scribe
4
|
April 26, 2005
|
Correlation Clustering |
Lecture
5
|
Scribe
5
|
May 3, 2005
|
Spectral Clustering: Guest
Lecture: Frank McSherry
|
Lecture
6
|
Scribe
6
|
May 10, 2005
|
Sublinear Clustering
|
Lecture
7
|
Scribe
7
|
May 17, 2005
|
Clustering Data Streams
|
Lecture
8
|
Scribe
8
|
May 24, 2005
|
Class Project Presentations Part
I
|
|
|
May 31, 2005
|
Class Project Presentations Part
II
|
|
|
Reading
List
k-Center
k-Median/k-Median-squared/Facility
Location
- Local
Search Heuristics for k-median and Facility Location Problems, V.
Arya, N. Garg, R. Khandekar, A.Meyerson, K. Munagala and V. Pandit.
STOC
2001. Rohit
Khandekar's slides
- Improved
Combinatorial Algorithms for the Facility Location and k-Median Problems,
M. Charikar and S. Guha. FOCS 1999.
- Clustering
in large graphs and matrices, P. Drineas, R. Kannan, A. Frieze, S.
Vempala, and V. Vinay. SODA 1999.
- A generalized convergence theorem and
characterization of local optimality. S. Selim and M.
Ismail. IEEE
Transansactions on Pattern Analysis and Machine Intelligence.
6:(1):81-86. 1984.
- A
simple linear time (1+eps)-approximation algorithm for k-Means
clustering in any dimensions A. Kumar, Y. Sabharwal, S. Sen. FOCS
2004.
- On
coresets for k-means and k-median clustering S. Har-Peled and S.
Mazumdar. STOC 2004.
- How
fast is the k-means Method?. B. Sadri and S. Har-Peled. In
Algorithmica, 41(3):185--202, 2005. Also in SODA 05.
- A
Local Search Approximation Algorithm for k-Means Clustering. T.
Kanungo, D. M. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Y.
Wu,
Computational Geometry: Theory and Applications, 28 (2004), 89-112.
Hierarchical
Clustering
Clustering Large Data Sets
- Testing
of Clustering. N. Alon, S. Dar, M. Parnas, D. Ron. FOCS
2000.
- Sublinear
Time Algorithms for Metric Space Problems. P. Indyk, STOC 1999.
- CURE:
An Efficient Clustering Algorithm for Large Databases, S. Guha, R.
Rastogi, and K. Shim. SIGMOD 1998. Note: this PDF file
requires a huge amount of temp space (over 200Mb).
- Clustering
Large Datasets in Arbitrary Metric Spaces, V. Ganti, R.
Ramakrishnan, J. Gehrke, A. L. Powell, and J.C. French. ICDE 1999.
- BIRCH:
an efficient data clustering method for very large databases, T.
Zhang, R. Ramakrishnan, and M. Livny. SIGMOD 1996.
- Sublinear
Time Approximate Clustering, N. Mishra, D. Oblinger, and L.
Pitt. SODA 2001.
- A
Framework for Statistical Clustering with a Constant Time Approximation
Algorithms for K-Median Clustering. S. Ben-David. p.
415-426. COLT'04.
Clustering Data Streams
- Incremental
Clustering and Dynamic Information Retrieval, M. Charikar, C.
Chekuri, T. Feder and R. Motwani. STOC 1997.
- Clustering
Data Streams, S. Guha, N. Mishra, R. Motwani and L. O'Callaghan.
FOCS
2000.
- Clustering
Data Streams: Theory and Practice. S. Guha, A.
Meyerson, N. Mishra, R. Motwani and L.
O'Callaghan. IEEE TKDE,
15(3): 515-528, 2003.
- Maintaining
Variance
and k-Medians over Data Stream Windows, B. Babcock, M. Datar,
R. Motwani, and L. O'Callaghan, PODS
2003.
- Better
Streaming Algorithms for Clustering
Problems, M. Charikar, L. O'Callaghan, and R.
Panigrahy, STOC 2003.
Spectral
Clustering
- Spectral
Partitioning Works: Planar graphs and finite element meshes,
Daniel A. Spielman, Shang-Hua Teng. FOCS 1996.
- On
the Quality of Spectral Separators, Stephen Guattery, Gary
L. Miller. SIAM J. Matrix Anal. Appl. 19(3):701-719.
- Fast
Monte-Carlo Algorithms for Finding Low-Rank Approximations, A.
Frieze, R. Kannan, and S. Vempala, FOCS 1998.
- Clustering
in large graphs and matrices, P. Drineas, R. Kannan, A. Frieze, S.
Vempala, and V. Vinay. SODA 1999.
- On
clusterings: good, bad and spectral, R. Kannan, S. Vempala, and A.
Vetta. FOCS 2000.
- Normalized
Cuts and Image Segmentation, J. Shi and J. Malik. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 22(8),
888-905, August 2000.
- On
Spectral Clustering: Analysis and an algorithm, A. Y. Ng, M.
Jordan, and Y. Weiss. NIPS 2000.
- Optimal
outlier removal in high-dimensional spaces, J, Dunagan and S.
Vempala. STOC 2001.
Conceptual
Clustering
Biclustering
Correlation Clustering
Clustering with Outliers
Clustering Moving Points
SVM Clustering
- Support
Vector Clustering, A. Ben-Hur, Horn, H. Siegelmann, V.
Vapnik.
Journal of Machine Learning Research, vol 2, pages 125-137. 2001.
Catalog Segmentation
Community Discovery
Axioms of Clustering
Cluster Evaluation
Model-based Clustering
- Learning
mixtures of arbitrary gaussians, S. Arora, R.
Kannan, STOC 2001
- Learning
mixtures of Gaussians, S. Dasgupta, FOCS 1999
- Maximum Likelihood from
Incomplete Data via the EM Algorithm. A.P. Dempster, N.M. Laird, and
D.B. Rubin. Journal of the Royal Statistical Society, Series B
(Methodological), 39(1):1--38, 1977.
Categorical
Clustering
Projective
Clustering
- On the complexity of locating linear facilities in the plane, N.
Megiddo,Tamir. Operations Research Letters, 1:194-197, 1982.
- Approximation
algorithms for projective clustering. P. Agarwal and C.
Procopiuc, Journal of Algorithms, 46 (2003), 115�139.
- An
approximation algorithm for computing the two-line center. P.
Agarwal, C. Procopiuc and K. Varadarajan, Computat. Geom.: Theory
and Appls., 26 (2003), 119-28.
- Approximation
algorithms for k-line center. P. Agarwal, C. Procopiuc and K.
Varadarajan, to appear in Algorithmica.
- Projective
clustering in high dimensions using core-sets. S. Har-Peled, K.
Varadarajan, SOCG 2002.
Dimension
Reduction
Scatter/Gather
- Constant
Interaction-Time Scatter/Gather Browsing of Large Document
Collections.
D. Cutting, D. Karger, and J. Pedersen. . Proceedings
of the 16th Annual International
ACM/SIGIR Conference, Pittsburgh, PA, 1993.
- Reexamining the
Cluster Hypothesis:
Scatter/Gather on Retrieval Results,
M. Hearst and J. Pedersen, Proceedings of the
19th Annual International ACM/SIGIR Conference, Zurich, August 1996.
- Scatter/Gather:
A Cluster-based Approach to Browsing Large Document
Collections,
D. Cutting, D. Karger, J. Pedersen, and J. Tukey. Proceedings of the
15th Annual International ACM/SIGIR Conference, Copenhagen, 1992.
Text Clustering