CS369C: Clustering Algorithms
Nina Mishra


Course Overview

One of the consequences of fast computers, the Internet and inexpensive storage is the widespread collection of data from a variety of sources and of a variety of types.  Sources of data include web click streams, financial transactions, and observational science data.  Data types include categorical vs. numerical, static vs. dynamic, points in a metric space vs. vertices in a graph.  The nagging question often posed about these data sets is: Can we find something interesting that we did not already know?  The first answer to this question is often: Let's try clustering the data!   Indeed, clustering is one of the most widely used tools for analyzing data sets.  Some modern applications of clustering include clustering the web, clustering search results, clustering click streams, customer segmentation, and community discovery in social networks.

Because of its recent ubiquitous applicability, the field of clustering has undergone major revolution over the last few decades characterized by advances in approximation and randomized algorithms, novel formulations of the clustering problem, algorithms for clustering massively large data sets, algorithms for clustering data streams, and dimension reduction techniques.  This course will cover these major advances particularly in the context of modern applications.

This course should be of interest to graduate students in computer science and related fields, especially those with an interest in seeing how the field of clustering has evolved and how much further it has to go.  Familiarity with basic material in algorithms, databases, probability, etc., at the level of the core undergraduate courses on these topics is assumed.



Grading:  Those students interested in taking the course for a letter grade must complete a class project, and scribe one lecture.  Those signing up for the pass/fail option will need to scribe a lecture or complete a class mini-project.

Class Project: Deadlines:
4/12: Proposal: 1 page.
5/10: Progress Report: at most 5 pages
5/24, 5/31: Project Presentations. 30 minutes.
6/7: Final Project Report


Scribe:  Each registered student will sign up as the official scribe for a specific lecture.  This involves taking detailed notes, reading the background papers, and preparing a set of lecture notes that will be distributed on the web.



Time and Location:

Tuesday:  2:15-4:05pm
Room: 200-013

Office Hours:

Wednesday:  4-5:30pm, 484 Gates


Overview of Lectures (Tentative)

Date Topic
Lecture Notes
Scribe
March 29, 2005
Introduction, Overview of the Class, Clustering Preliminaries, k-Center
Lecture 1 Scribe 1
April 5, 2005
k-Median Clustering
Lecture 2 Scribe 2
April 12, 2005
Squared Error Distortion/k-Median-squared
Lecture 3
Scribe 3
April 19, 2005
Hierarchical Clustering Lecture 4
Scribe 4
April 26, 2005
Correlation Clustering Lecture 5
Scribe 5
May 3, 2005
Spectral Clustering: Guest Lecture: Frank McSherry
Lecture 6
Scribe 6
May 10, 2005
Sublinear Clustering
Lecture 7
Scribe 7
May 17, 2005
Clustering Data Streams
Lecture 8
Scribe 8
May 24, 2005
Class Project Presentations Part I


May 31, 2005
Class Project Presentations Part II





Reading List

k-Center


k-Median/k-Median-squared/Facility Location


Hierarchical Clustering

 
Clustering Large Data Sets


Clustering Data Streams


Spectral Clustering


Conceptual Clustering


Biclustering


Correlation Clustering

Clustering with Outliers

Clustering Moving Points

SVM Clustering

Catalog Segmentation


Community Discovery


Axioms of Clustering


Cluster Evaluation


Model-based Clustering


Categorical Clustering


Projective Clustering


Dimension Reduction


Scatter/Gather


Text Clustering