CS 361A
(Advanced
Algorithms for Internet Applications)
News
Flash Administrivia
Sign-up
Course
Overview Topics
Schedule
Reading
List
News
Flash
Midterm Exam
We have prepared a take-home midterm exam which is due
in class on Wednesday, Nov 13. The exam is available here in
postscript
and pdf
formats.
Administrivia
Instructors: Rajeev
Motwani and Nina
Mishra
Teaching Assistant: Mayur Datar
Class
Schedule: Mon/Wed, 3:15-4:30,
Gates B12
Office
Hours:
[Mayur Datar]
Thu/Fri, 1:30-2:30, Gates 482
[Nina Mishra]
Thu, 3:00-4:00, Gates
484
[Rajeev Motwani] Tue, 1:30-2:30, Gates
474
Class Mailing List We
have set up a class mailing list. Please subscribe to it to get latest
information regarding the class. The email address is
cs361a-class@lists.stanford.edu. You can subscribe by sending a mail to
majordomo@lists.stanford.edu
with the following text in the body of the mail:
subscribe
cs361a-class
Grading Policy Revision
Several MS students want to take this course on for a letter grade to satisfy
their specialization requirements. We have modified our sign-up policy to
allow this, so feel free to sign up for a letter grade. We will use the scribe
notes (see below), a couple of homeworks, and class participation to determine
the grade for these students. So, if you sign up for a letter grade, be sure
to serve as a scribe for at least one lecture.
Sign-Up
To sign up for this course,
please send email to Mayur Datar
with the following
information:
name, department, status
(Phd/MS/UG, year), area of specialization
(Databases, Systems, Theory, etc), and email
address.
Course
Overview
With the maturing of the
Internet, the field of algorithms is undergoing an interesting transformation.
For one thing, new areas and applications requiring an algorithmic mind-set
have emerged, such as information retrieval and web searching, massive and
streaming data, data mining, machine learning, distributed systems (including
so-called P2P networks), and network algorithms. To service these, novel
algorithmic techniques have been and are being developed. Furthermore, new
applications have led to new models for algorithms, most prominent of which is
the field of algorithms for data streams. This course will give an overview of
such topics with an eye towards identifying interesting research directions.
Since Stanford people have played a prominent role in these new developments,
wherever possible we will attempt to bring in as guest lecturers the original
authors of the papers being covered. This course should be of interest to
graduate students in computer science and related fields, especially those
with a mathematical bent of mind. We will assume familiarity with basic
material in algorithms, databases, probability, etc., (at the level of the
core undergraduate courses on these topics).
Grading Since this
course will be treated as a graduate research seminar, we expect students will
register pass/fail (and not for a letter grade). There will be little by way
of formal exams, although we may have occasional homework assignments.
In fact, most of the grade will depend on class participation and the scribe
notes prepared by students (see below).
Scribe Each registered
student will sign up as the official scribe for a specific lecture. This
involves taking detailed notes, reading the background papers, and preparing a
set of lecture notes that will be handed out to the entire course.
Topics
Schedule
Week |
Dates |
Topic |
Lecturer |
Slides |
Scribe Notes |
1 |
Wed, Sep
25 |
Introduction: Computing Distinct Values
|
Rajeev Motwani |
Slides
1
(ppt) |
Scribe 1 (ps,
pdf) |
2 |
Mon, Sep
30 |
Data
Streams 1 (Sampling/Sketching/Synopses) |
Rajeev
Motwani Mayur Datar |
Slides
2 (ppt)
Slides
3
(ppt) |
Scribe 2 (ps,
pdf)
|
Wed, Oct
2 |
3 |
Mon, Oct 7 |
Data
Streams 2 (Histograms/Quantiles) |
Rajeev Motwani
Gurmeet Manku |
Slides
4
(ppt)
Slides
5
(ppt) |
Scribe
4.1 (doc,
pdf)
Scribe
4.2 (ps,
pdf) |
Wed, Oct
9 |
4 |
Mon, Oct 14
|
Association Rules |
Rajeev Motwani Nina
Mishra |
Slides
6
(ppt)
Slides
7(ppt) |
Scribe
6.1 (doc,
pdf)
Scribe
6.2 (ps,
pdf)
Scribe 7 (ps,
pdf) |
Wed, Oct
16 |
5 |
Mon, Oct
21 |
Clustering |
Nina
Mishra |
Slides
8 (ppt)
Slides
9 (ppt) |
Scribe 8 (doc,
pdf)
Scribe 9.1 (ps,
pdf)
Scribe 9.2 (doc,
pdf) |
Wed, Oct
23 |
6 |
Mon, Oct
28 |
Machine
Learning |
Nina
Mishra |
Slides
10 (ppt)
Slides
11 (ppt) |
Scribe 10 (ps,
pdf)
Scribe
11 (ps,
pdf) |
Wed, Oct
30 |
7 |
Mon, Nov 4 |
Nearest
Neighbors and Similarity |
Aris Gionis
Rajeev
Motwani |
Slides 12 (ps,
pdf)
Slides
13 (ppt) |
Scribe 12.1 (doc,
pdf)
Scribe 12.2 (doc,
pdf)
Scribe 13 (ps,
pdf) |
Wed, Nov
6 |
8 |
Mon, Nov
11 |
External
Memory Algorithms |
Rajeev
Motwani
Kamesh Munagala |
Slides
14 (ppt)
Slides
15 (ppt) |
Scribe 15 (doc,
pdf) |
Wed, Nov
13 |
9 |
Mon, Nov
18 |
Web Graph
and Link Analysis |
Monika Henzinger Glen
Jeh Taher Haveliwala |
Slides 16 (ppt)
Slides 17
(ppt) |
Scribe 16 (doc,
pdf)
Scribe 16.2 (doc,
ps)
Scribe 17.1 (doc,
pdf)
Scribe 17.2 (doc,
pdf)
Scribe 17.3 (doc,
pdf) |
Wed, Nov
20 |
10 |
Mon, Nov
25 |
Network
Algorithms |
Balaji Prabhakar |
Slides 18 (ps)
Slides 19 (ps) |
Scribe 19.1 (ps,
pdf)
Scribe 19.2 (doc,
pdf) |
Wed, Nov
27 |
11 |
Mon, Dec 2 |
Distributed
Hashing and P2P Networks
Removing Duplicates |
Datar/ Motwani
Andrei Broder |
Slides
20 (ppt) |
|
Wed, Dec
4 |
Reading
List
Introduction: Computing Distinct Values - Rajeev
Motwani
- Towards
Estimation Error Guarantees for Distinct Values, M. Charikar, S.
Chaudhuri, R. Motwani, and V. Narasayya. PODS 2000.
- Probabilistic counting algorithms for data base applications. P. Flajolet
and G. N. Martin. JCSS 31, 2, 1985.
- A
Linear Time Probabilistic Counting Algorithm for Database Applications,
K-Y. Whang, B. V. Zanden, and H. Taylor. TODS 15, 2, 1990.
- The
space complexity of approximating the frequency moments, N. Alon, Y.
Matias, and M. Szegedy. STOC 1996.
- Distinct
Sampling for Highly-Accurate Answers to Distinct Values Queries and Event
Reports, P. B. Gibbons. VLDB 2001.
Data
Streams 1 (Sampling/Sketching/Synopses) - Rajeev
Motwani/Mayur
Datar
Introduction
Sampling
- Random Sampling with a
Reservoir, J. S. Vitter. Trans. on Mathematical Software 11(1):37-57
(1985).
- Sampling from
a Moving Window over Streaming Data, B. Babcock, M. Datar, and R. Motwani.
SODA 2002.
- On
Random Sampling over Joins, S. Chaudhuri, R. Motwani, and V.
Narasayya. SIGMOD 1999.
- Towards
Estimation Error Guarantees for Distinct Values, M. Charikar, S.
Chaudhuri, R. Motwani, and V. Narasayya. PODS 2000.
- Distinct
Sampling for Highly-Accurate Answers to Distinct Values Queries and Event
Reports, P. B. Gibbons. VLDB 2001.
- Sampling
algorithms: lower bounds and applicaitons, Z. Bar-Yossef, S. Ravi Kumar, and
D. Sivakumar.STOC 2001.
Sketching
- Probabilistic counting algorithms for data base applications. P. Flajolet
and G. N. Martin. JCSS 31, 2, 1985.
- A
Linear Time Probabilistic Counting Algorithm for Database Applications,
K-Y. Whang, B. V. Zanden, and H. Taylor. TODS 15, 2, 1990.
- The
space complexity of approximating the frequency moments, N. Alon, Y.
Matias, and M. Szegedy. STOC 1996.
- Finding
Frequent Items in Data Streams, M. Charikar, K. Chen, and M.
Farach-Colton. ICALP 2002.
- An
Approximate L1-Difference Algorithm for Massive Data Streams, J.
Feigenbaum, S. Kannan, M. Strauss, M. Viswanathan. FOCS 1999.
- Stable Distributions,
Pseudorandom Generators, Embeddings and Data Stream Computation, P. Indyk.
FOCS 2000.
Synopses
- Tracking
Join and Self-Join Sizes in Limited Storage, N. Alon, P. Gibbons, Y.
Matias and M. Szegedy. PODS 1999.
- Join
Synopses for Approximate Query Answering, S. Acharya, P. Gibbons, V.
Poosala, and S. Ramaswamy. SIGMOD 1999.
- Congressional
Samples for Approximate Answering of Group-By Queries, S. Acharya,
P, Gibbons, and V. Poosala. SIGMOD 2000.
- Overcoming
Limitations of Sampling for Aggregation Queries,
S. Chaudhuri, G. Das, M. Datar, R. Motwani and V. Narasayya. ICDE
2001.
A
Robust Optimization-Based Approach for Approximate Answering of Aggregate
Queries, S. Chaudhuri, G. Das and V.
Narasayya. SIGMOD 2001.
Synopsis
data structures for massive data sets, P. B. Gibbons and Y. Matias,
DIMACS 1999.
Data
Streams 2 (Synopses/Algorithms) - Rajeev
Motwani/Gurmeet Manku
Quantiles and Histograms
- The
New Jersey data reduction report, D. Barbara et al. IEEE Data
Engineering Bulletin 1997.
- Optimal
Histograms with Quality Guarantees, H.V. Jagadish, N.Koudas, S.
Muthukrishnan, V. Poosala, K. Sevcik, and T. Suel. VLDB 1998.
- Random
Sampling Techniques for Space Efficient Online Computation of Order Statistics
of Large Datasets, G. S. Manku, S. Rajagopalan, and B. G. Lindsay.
SIGMOD 1999.
- Space-efficient
online computation of quantile summaries, M. Greenwald and S. Khanna.
SIGMOD 2001.
- Dynamic
multidimensional histograms, N. Thaper, S. Guha, P. Indyk, and N. Koudas.
SIGMOD 2002.
- Fast,
small-space algorithms for approximate histogram maintenance, A. Gilbert,
S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. Strauss. STOC
2002.
Sliding Window Algorithms
- Maintaining
Stream Statistics Over Sliding Windows, M. Datar, A. Gionis, P. Indyk, R.
Motwani. SODA 2002.
- Distributed
Streams Algorithms for Sliding Windows, P. Gibbons, S. Tirthapura. SPAA
2002.
- Maintaining Variance and K-Medians over Data Stream Windows. B. Babcock,
M. Datar, R. Motwani, and L. O'Callaghan. Manuscript, 2002.
Association Rules - Rajeev
Motwani/Nina
Mishra
Association Rule Mining and
Generalizations
- Mining
Associations between Sets of Items in Massive Databases, R. Agrawal, T.
Imielinski, and A. Swami. SIGMOD 1993.
- Fast
Algorithms for Mining Association Rules, R. Agrawal and R. Srikant.
VLDB 1994.
- An
Effective Hash-Based Algorithm for Mining Association Rules, J. S. Park,
M.-S. Chen, and P. S. Yu. SIGMOD 1995.
-
An Efficient
Algorithm for Mining Association Rules in Large Databases ,
A. Savasere, E. Omiecinski, and S. Navathe. The VLDB
Journal 1995.
- Sampling
Large Databases for Association Rules, H. Toivonen. VLDB 1996.
- Dynamic Itemset
Counting and Implication Rules for Market Basket Data, S. Brin, R.
Motwani, S. Tsur, and J.D. Ullman. SIGMOD 1997.
- Query Flocks: A
Generalization of Association-Rule Mining, D. Tsur, J.D. Ullman, S.
Abiteboul, C. Clifton, R. Motwani, S. Nestorov and A. Rosenthal. SIGMOD
1998.
- Finding
Interesting Associations without Support Pruning, E. Cohen, M. Datar, S.
Fujiwara, A. Gionis, P. Indyk, R. Motwani, J.D. Ullman, and C. Yang. ICDE
2000.
- Dynamic
Miss-Counting Algorithms: Finding Implication and Similarity Rules with
Confidence Pruning, S. Fujiwara, R. Motwani, and J.D. Ullman. ICDE
2000.
Combinatorics of Association Rules
- New
results on monotone dualization and generating hypergraph transversals, T.
Eiter, G. Gottlob, and K. Makino. STOC 2002.
- Efficient
Read-Restricted Monotone CNF/DNF Dualization by Learning with Membership
Queries, C. Domingo, N. Mishra, and L. Pitt. Machine Learning
37(1):89-110 (1999).
- Identifying the Minimal
Transversals of a Hypergraph and Related Problems, T. Eiter, and G.
Gottlob. SIAM Journal on Computing 24(6): 1278-1304 (1995).
- On the Complexity of
Dualization of Monotone Disjunctive Normal Forms, M.L. Fredman and L.
Khachiyan.
Journal of Algorithms 21(3): 618-628 (1996).
- Levelwise search and borders of theories in knowledge
discovery, H. Mannila and H. Toivonen. Data Mining and
Knowledge Discovery 1: 241-258 (1997).
Frequency Counting
Clustering - Nina
Mishra
Basic Clustering Algorithms
- Clustering to
minimize the maximum intercluster distance, T.F. Gonzalez. Theoretical
Computer Science, 38: 293-306, (1985).
- Criteria for Polynomial-Time (Conceptual)
Clustering, L. Pitt and Reinke. Machine Learning 2:371-396
(1988).
- Local
Search Heuristics for k-median and Facility Location Problems, V. Arya, N.
Garg, R. Khandekar, A.Meyerson, K. Munagala and V. Pandit. STOC 2001.
- Improved
Combinatorial Algorithms for the Facility Location and k-Median Problems,
M. Charikar and S. Guha. FOCS 1999.
- Data Clustering: a review, A. K. Jain, M. N.
Murty, and P. J. Flynn, ACM Computing
Surveys, 31(3), 1999.
Clustering Large Data Sets and Streams
- Incremental
Clustering and Dynamic Information Retrieval, M. Charikar, C. Chekuri, T.
Feder and R. Motwani. STOC 1997.
- Clustering
Data Streams, S. Guha, N. Mishra, R. Motwani and L. O'Callaghan. FOCS
2000.
- Sublinear
Time Approximate Clustering, N. Mishra, D. Oblinger, and L. Pitt.
SODA 2001.
- High-Performance Clustering
of Streams and Large Data Sets, S. Guha, L. O'Callaghan, N. Mishra, A.
Meyerson, and R. Motwani. ICDE 2002.
Database Clustering
- Scaling
Clustering Algorithms to Large Databases, P. Bradley, U. Fayyad, and C.
Reina. KDD 1998.
- CURE:
An Efficient Clustering Algorithm for Large Databases, S. Guha, R.
Rastogi, and K. Shim. SIGMOD 1998. Note: this PDF file requires a
huge amount of temp space (over 200Mb).
- Clustering
Large Datasets in Arbitrary Metric Spaces, V. Ganti, R. Ramakrishnan, J.
Gehrke, A. L. Powell, and J.C. French. ICDE 1999.
- BIRCH: an efficient data clustering method for very large
databases, T. Zhang, R. Ramakrishnan, and M. Livny. SIGMOD 1996.
Spectral Clustering
- Fast
Monte-Carlo Algorithms for Finding Low-Rank Approximations, A. Frieze, R.
Kannan, and S. Vempala, FOCS 1998.
- Clustering in
large graphs and matrices, P. Drineas, R. Kannan, A. Frieze, S. Vempala,
and V. Vinay. SODA 1999.
- On clusterings:
good, bad and spectral, R. Kannan, S. Vempala, and A. Vetta. FOCS
2000.
- Normalized
Cuts and Image Segmentation, J. Shi and J. Malik. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 22(8), 888-905, August 2000.
- On
Spectral Clustering: Analysis and an algorithm, A. Y. Ng, M. Jordan, and
Y. Weiss. NIPS 2000.
- Optimal outlier
removal in high-dimensional spaces, J, Dunagan and S. Vempala. STOC
2001.
Machine Learning -
Nina
Mishra
Similarity and Nearest Neighbors - Aris Gionis/Rajeev Motwani
Similarity
- On
the Resemblance and Containment of Documents, A. Broder. SEQUENCES
1997.
-
Syntactic
Clustering of the Web, A. Broder, S. Glassman, M. Manasse, and G.
Zweig, WWW6 1997.
- Min-Wise
Independent Permutations, A. Broder, M. Charikar, A. Frieze and M.
Mitzenmacher, JCSS 60(3): 630-659 (2000).
- Identifying and Filtering Near-Duplicate Documents, Andrei Broder. CPM
2000.
- Finding
Interesting Associations without Support Pruning, E. Cohen, M. Datar, S.
Fujiwara, A. Gionis, P. Indyk, R. Motwani, J.D. Ullman, and C. Yang. ICDE
2000.
- Similarity
Estimation Techniques from Rounding Algorithms, M. Charikar, STOC
2002.
Nearest Neighbors
Random Projections
- Extensions of Lipschitz maps into a
Hilbert space, W. Johnson and J. Lindenstrauss. Contemp.
Math., 26 (1984), 189-206.
- The Johnson-Lindenstrauss lemma and the sphericity of some
graphs, P. Frankl and H. Maehara. Journal of
Combinatorial Theory Series B, 44:355-362, 1988.
- An
elementary proof of the Johnson Lindenstrauss lemma, S. Dasgupta and
Anupam Gupta. ICSI TR-99-006.
- Fast
Monte-Carlo Algorithms for Finding Low-Rank Approximations, A. Frieze, R.
Kannan, and S. Vempala. FOCS 1998.
- Database-friendly
Random Projections, D. Achlioptas, PODS 2001.
External
Memory Algorithms - Kamesh Munagala
- External
Memory Algorithms and Data Structures, J. S. Vitter. DIMACS Series.
- The
input/output complexity of sorting and related problems, A. Aggarwal and
J. Vitter. CACM 31(9), 1988.
- External-Memory
Graph Algorithms, Y.-J. Chiang, M. T. Goodrich, E. F. Grove, R.
Tamassia, D. E. Vengroff, and J. S. Vitter. SODA 1995.
- I/O-Complexity of
Graph Algorithms, K. Munagala and A. Ranade. SODA 1999.
Web Graph
and Link Analysis - Monika Henzinger/Glen Jeh/Taher Haveliwala
PageRank and
Hubs-Authorities
- The
PageRank Citation Ranking: Bringing Order to the Web, L. Page, S. Brin, R.
Motwani, T. Winograd. Stanford Digital Libraries Working Paper, 1998.
- What
can you do with a Web in your Pocket?, S. Brin, R. Motwani, L. Page, and
T. Winograd, Bulletin of the Technical Committee on Data Engineering,
21(1998): 37-47.
- Authoritative
sources in a hyperlinked environment, J. Kleinberg. SODA 1998.
Personalized PageRank
Network Algorithms -
Balaji
Prabhakar
-
Network
Algorithms, A. Goel, N. McKeown, and B. Prabhakar. (Powerpoint Slides of
Tutorial at Sigcomm 2001).
Distributed
Hashing and P2P Networks - Mayur Datar
- Consistent Hashing and
Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the
World Wide Web, D. Karger, E. Lehman, T. Leighton, M. Levine, D. Lewin,
and R. Panigrahy. STOC 1997.
- Web
Caching and Consistent Hashing, D. Karger, A. Sherman, A. Berkheimer, B.
Bogstad, R. Dhanidina, K. Iwamoto, B. Kim, L. Matkins, and Y. Yerushalmi.
WWW 1999.
- Chord:
A Scalable Peer-to-peer Lookup Service for Internet Applications, I.
Stoica, R. Morris, D. Karger, M. Frans Kaashoek, and H. Balakrishnan.
SIGCOMM 2001.
- Analysis of the
Evolution of Peer to Peer Systems, D. Liben-Nowell, H. Balakrishnan, and
D. Karger.
- A
Scalable Content-Addressable Network, S. Ratnasamy, P. Francis, M.
Handley, R. Karp, and S. Shenker. SIGCOMM 2001.
- Pastry:
Scalable, distributed object location and routing for large-scale peer-to-peer
systems, A. Rowstron and P. Druschel. Middleware 2001.
- Viceroy: A
Scalable and Dynamic Emulation of the Butterfly, D. Malkhi, M. Naor, and
D. Ratajczak. PODC 2002.
- Tapestry: An
Infrastructure for Fault-Tolerant Wide-Area Location and Routing, B. Y.
Zhao, J. D. Kubiatowicz, A. and D. Joseph, Berkeley Report UCB/CSD 01/1141
(2001).
- Distributed
Object Location in a Dynamic Network, K. Hildrum, J. D.
Kubatowicz, S. Rao, and B. Y. Zhao. SPAA 2002.
- Accessing
Nearby Copies of Replicated Objects in a Distributed Environment, C.G.
Plaxton, R. Rajaraman, A. W. Richa. SPAA 1997.
- Search
and replication in unstructured peer-to-peer networks, Q. Lv, P. Cao, E.
Cohen, K. Li, and S. Shenker. SIGMetrics 2002.
Additional Topics
Epidemics, Gossiping, and Rumor
Mongering
- Epidemic Algorithms for Replicated Database Maintenance. A.J. Demers, D.H.
Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H.E. Sturgis, D.C.
Swinehart, and D.B. Terry. Operating Systems Review 22(1): 8-32
(1988).
- Scalable and secure resource
location, R. van Renesse. HICSS 2000.
- Randomized Rumor
Spreading, R. Karp, C. Schindelhauer, S. Shenker, and B. Vocking.
FOCS 2000.
- Spatial gossip and
resource location protocols, D. Kempe, J. Kleinberg, and A. Demers.
STOC 2001.
OLAP and Datacubes
Fuzzy Information and
Aggregation Algorithms
Compression
- Notes
on Compression by Guy Blelloch from CMU.
- Managing Gigabytes: Compressing and
Indexing Documents and Images, I.H. Witten, A. Moffat and T. C. Bell,
Morgan Kauffman, 1999.
- Data
Compression: The Complete Reference, D. Salomon, Springer Verlag,
1998.
- Introduction
to Data Compression, Second Edition, K. Sayood, Morgan Kaufmann, 2000.
- Compressed
Bloom Filters, M. Mitzenmacher, PODC 2001.
Indexing and Searching:
Linear Algebra in Information
Retrieval
Error-correcting Codes
-
Michael Mitzenmacher's slides on
codes (Shannon's theorem and introduction to error correcting codes: ps,
Reed-Solomon Codes: ps)
-
Practical Loss-Resilient
Codes, M. Luby, M. Mitzenmacher, A. Shokrollahi, D. Spielman,
and V. Stemann, STOC 1997. (ps,
pdf,
slides(ppt))
-
Analysis of Random Processes via
And-Or Tree Evaluation, M. Luby, M. Mitzenmacher, and A. Shokrollahi,
SODA 1998. (ps,
pdf)
-
Analysis of Low Density Codes and
Improved Designs Using Irregular Graphs, M. Luby, M. Mitzenmacher, A.
Shokrollahi, and D. Spielman, STOC 1998. (ps,
pdf,
slides(ppt))