CS 361A
(Advanced
Algorithms for Internet Applications)
News
Flash Administrivia
Signup
Course
Overview Topics
Schedule
Reading
List
News
Flash
Midterm Exam
We have prepared a takehome midterm exam which is
due in class on Wednesday, Nov 13. The exam is available here in postscript
and pdf
formats.
Administrivia
Instructors: Rajeev
Motwani and Nina
Mishra
Teaching Assistant: Mayur Datar
Class
Schedule: Mon/Wed, 3:154:30,
Gates B12
Office
Hours:
[Mayur Datar]
Thu/Fri, 1:302:30, Gates 482
[Nina Mishra]
Thu, 3:004:00, Gates
484
[Rajeev Motwani] Tue, 1:302:30, Gates
474
Class Mailing List We
have set up a class mailing list. Please subscribe to it to get latest
information regarding the class. The email address is cs361aclass@lists.stanford.edu.
You can subscribe by sending a mail to majordomo@lists.stanford.edu
with the following text in the body of the
mail: subscribe cs361aclass
Grading Policy Revision
Several MS students want to take this course on for a letter grade to satisfy
their specialization requirements. We have modified our signup policy to
allow this, so feel free to sign up for a letter grade. We will use the scribe
notes (see below), a couple of homeworks, and class participation to determine
the grade for these students. So, if you sign up for a letter grade, be sure
to serve as a scribe for at least one lecture.
SignUp
To sign up for this course,
please send email to Mayur Datar
with the following
information:
name, department, status
(Phd/MS/UG, year), area of specialization
(Databases, Systems, Theory, etc), and email
address.
Course
Overview
With the maturing of the
Internet, the field of algorithms is undergoing an interesting transformation.
For one thing, new areas and applications requiring an algorithmic mindset
have emerged, such as information retrieval and web searching, massive and
streaming data, data mining, machine learning, distributed systems (including
socalled P2P networks), and network algorithms. To service these, novel
algorithmic techniques have been and are being developed. Furthermore, new
applications have led to new models for algorithms, most prominent of which is
the field of algorithms for data streams. This course will give an overview of
such topics with an eye towards identifying interesting research directions.
Since Stanford people have played a prominent role in these new developments,
wherever possible we will attempt to bring in as guest lecturers the original
authors of the papers being covered. This course should be of interest to
graduate students in computer science and related fields, especially those
with a mathematical bent of mind. We will assume familiarity with basic
material in algorithms, databases, probability, etc., (at the level of the
core undergraduate courses on these topics).
Grading Since this
course will be treated as a graduate research seminar, we expect students will
register pass/fail (and not for a letter grade). There will be little by way
of formal exams, although we may have occasional homework assignments.
In fact, most of the grade will depend on class participation and the scribe
notes prepared by students (see below).
Scribe Each registered
student will sign up as the official scribe for a specific lecture. This
involves taking detailed notes, reading the background papers, and preparing a
set of lecture notes that will be handed out to the entire course.
Topics
Schedule
Week 
Dates 
Topic 
Lecturer 
Slides 
Scribe
Notes 
1 
Wed, Sep
25 
Introduction: Computing Distinct Values

Rajeev Motwani 
Slides
1 (ppt) 
Scribe 1 (ps,
pdf) 
2 
Mon, Sep
30 
Data
Streams 1 (Sampling/Sketching/Synopses) 
Rajeev
Motwani Mayur Datar 
Slides
2 (ppt) Slides
3 (ppt) 
Scribe 2 (ps,
pdf) 
Wed, Oct
2 
3 
Mon, Oct 7 
Data
Streams 2 (Histograms/Quantiles) 
Rajeev
Motwani Gurmeet Manku 
Slides
4 (ppt) Slides
5 (ppt) 
Scribe 4.1 (doc,
pdf) Scribe
4.2 (ps,
pdf) 
Wed, Oct
9 
4 
Mon, Oct 14

Association Rules 
Rajeev Motwani Nina
Mishra 
Slides
6 (ppt) Slides
7(ppt) 
Scribe 6.1 (doc,
pdf) Scribe
6.2 (ps,
pdf) Scribe
7 (ps,
pdf) 
Wed, Oct
16 
5 
Mon, Oct
21 
Clustering 
Nina
Mishra 
Slides
8 (ppt) Slides
9 (ppt) 
Scribe 8 (doc,
pdf) Scribe
9.1 (ps,
pdf) Scribe
9.2 (doc,
pdf) 
Wed, Oct
23 
6 
Mon, Oct
28 
Machine
Learning 
Nina
Mishra 
Slides
10 (ppt) Slides
11 (ppt) 
Scribe 10 (ps,
pdf) Scribe
11 (ps,
pdf) 
Wed, Oct
30 
7 
Mon, Nov 4 
Nearest
Neighbors and Similarity 
Aris Gionis Rajeev Motwani 
Slides 12 (ps,
pdf) Slides
13 (ppt) 
Scribe 12.1 (doc,
pdf) Scribe
12.2 (doc,
pdf) Scribe
13 (ps,
pdf) 
Wed, Nov
6 
8 
Mon, Nov
11 
External
Memory Algorithms 
Rajeev Motwani Kamesh Munagala 
Slides
14 (ppt) Slides
15 (ppt) 
Scribe 15 (doc,
pdf) 
Wed, Nov
13 
9 
Mon, Nov
18 
Web
Graph and Link Analysis 
Monika Henzinger Glen
Jeh Taher Haveliwala 
Slides
16 (ppt) Slides
17 (ppt) 
Scribe 16 (doc,
pdf) Scribe
16.2 (doc,
ps) Scribe
17.1 (doc,
pdf) Scribe
17.2 (doc,
pdf) Scribe
17.3 (doc,
pdf) 
Wed, Nov
20 
10 
Mon, Nov
25 
Network
Algorithms 
Balaji Prabhakar 
Slides 18 (ps) Slides
19 (ps) 
Scribe 19.1 (ps,
pdf) Scribe
19.2 (doc,
pdf) 
Wed, Nov
27 
11 
Mon, Dec 2 
Distributed
Hashing and P2P Networks Removing Duplicates 
Datar/ Motwani Andrei
Broder 
Slides
20 (ppt) 

Wed, Dec
4 
Reading
List
Introduction: Computing Distinct Values  Rajeev
Motwani
 Towards
Estimation Error Guarantees for Distinct Values, M. Charikar, S.
Chaudhuri, R. Motwani, and V. Narasayya. PODS 2000.
 Probabilistic counting algorithms for data base applications. P. Flajolet
and G. N. Martin. JCSS 31, 2, 1985.
 A
Linear Time Probabilistic Counting Algorithm for Database Applications,
KY. Whang, B. V. Zanden, and H. Taylor. TODS 15, 2, 1990.
 The
space complexity of approximating the frequency moments, N. Alon, Y.
Matias, and M. Szegedy. STOC 1996.
 Distinct
Sampling for HighlyAccurate Answers to Distinct Values Queries and Event
Reports, P. B. Gibbons. VLDB 2001.
Data
Streams 1 (Sampling/Sketching/Synopses)  Rajeev
Motwani/Mayur
Datar
Introduction
Sampling
 Random Sampling with a
Reservoir, J. S. Vitter. Trans. on Mathematical Software 11(1):3757
(1985).
 Sampling from
a Moving Window over Streaming Data, B. Babcock, M. Datar, and R. Motwani.
SODA 2002.
 On
Random Sampling over Joins, S. Chaudhuri, R. Motwani, and V.
Narasayya. SIGMOD 1999.
 Towards
Estimation Error Guarantees for Distinct Values, M. Charikar, S.
Chaudhuri, R. Motwani, and V. Narasayya. PODS 2000.
 Distinct
Sampling for HighlyAccurate Answers to Distinct Values Queries and Event
Reports, P. B. Gibbons. VLDB 2001.
 Sampling
algorithms: lower bounds and applicaitons, Z. BarYossef, S. Ravi Kumar,
and D. Sivakumar.STOC 2001.
Sketching
 Probabilistic counting algorithms for data base applications. P. Flajolet
and G. N. Martin. JCSS 31, 2, 1985.
 A
Linear Time Probabilistic Counting Algorithm for Database Applications,
KY. Whang, B. V. Zanden, and H. Taylor. TODS 15, 2, 1990.
 The
space complexity of approximating the frequency moments, N. Alon, Y.
Matias, and M. Szegedy. STOC 1996.
 Finding
Frequent Items in Data Streams, M. Charikar, K. Chen, and M.
FarachColton. ICALP 2002.
 An
Approximate L1Difference Algorithm for Massive Data Streams, J.
Feigenbaum, S. Kannan, M. Strauss, M. Viswanathan. FOCS 1999.
 Stable Distributions,
Pseudorandom Generators, Embeddings and Data Stream Computation, P. Indyk.
FOCS 2000.
Synopses
 Tracking
Join and SelfJoin Sizes in Limited Storage, N. Alon, P. Gibbons, Y.
Matias and M. Szegedy. PODS 1999.
 Join
Synopses for Approximate Query Answering, S. Acharya, P. Gibbons, V.
Poosala, and S. Ramaswamy. SIGMOD 1999.
 Congressional
Samples for Approximate Answering of GroupBy Queries, S. Acharya,
P, Gibbons, and V. Poosala. SIGMOD 2000.
 Overcoming
Limitations of Sampling for Aggregation Queries,
S. Chaudhuri, G. Das, M. Datar, R. Motwani and V. Narasayya. ICDE
2001.
 A
Robust OptimizationBased Approach for Approximate Answering of Aggregate
Queries, S. Chaudhuri, G. Das and V.
Narasayya. SIGMOD 2001.
 Synopsis
data structures for massive data sets, P. B. Gibbons and Y. Matias,
DIMACS 1999.
Data
Streams 2 (Synopses/Algorithms)  Rajeev Motwani/Gurmeet Manku
Quantiles and Histograms
 The
New Jersey data reduction report, D. Barbara et al. IEEE Data
Engineering Bulletin 1997.
 Optimal
Histograms with Quality Guarantees, H.V. Jagadish, N.Koudas, S.
Muthukrishnan, V. Poosala, K. Sevcik, and T. Suel. VLDB 1998.
 Random
Sampling Techniques for Space Efficient Online Computation of Order Statistics
of Large Datasets, G. S. Manku, S. Rajagopalan, and B. G. Lindsay.
SIGMOD 1999.
 Spaceefficient
online computation of quantile summaries, M. Greenwald and S. Khanna.
SIGMOD 2001.
 Dynamic
multidimensional histograms, N. Thaper, S. Guha, P. Indyk, and N. Koudas.
SIGMOD 2002.
 Fast,
smallspace algorithms for approximate histogram maintenance, A. Gilbert,
S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. Strauss. STOC
2002.
Sliding Window Algorithms
 Maintaining
Stream Statistics Over Sliding Windows, M. Datar, A. Gionis, P. Indyk, R.
Motwani. SODA 2002.
 Distributed
Streams Algorithms for Sliding Windows, P. Gibbons, S. Tirthapura. SPAA
2002.
 Maintaining Variance and KMedians over Data Stream Windows. B. Babcock,
M. Datar, R. Motwani, and L. O'Callaghan. Manuscript, 2002.
Association Rules  Rajeev Motwani/Nina Mishra
Association Rule Mining and
Generalizations
 Mining
Associations between Sets of Items in Massive Databases, R. Agrawal, T.
Imielinski, and A. Swami. SIGMOD 1993.
 Fast
Algorithms for Mining Association Rules, R. Agrawal and R. Srikant.
VLDB 1994.
 An
Effective HashBased Algorithm for Mining Association Rules, J. S. Park,
M.S. Chen, and P. S. Yu. SIGMOD 1995.
 An Efficient
Algorithm for Mining Association Rules in Large Databases ,
A. Savasere, E. Omiecinski, and S. Navathe.
The VLDB Journal 1995.
 Sampling
Large Databases for Association Rules, H. Toivonen. VLDB 1996.
 Dynamic Itemset
Counting and Implication Rules for Market Basket Data, S. Brin, R.
Motwani, S. Tsur, and J.D. Ullman. SIGMOD 1997.
 Query Flocks: A
Generalization of AssociationRule Mining, D. Tsur, J.D. Ullman, S.
Abiteboul, C. Clifton, R. Motwani, S. Nestorov and A. Rosenthal. SIGMOD
1998.
 Finding
Interesting Associations without Support Pruning, E. Cohen, M. Datar, S.
Fujiwara, A. Gionis, P. Indyk, R. Motwani, J.D. Ullman, and C. Yang. ICDE
2000.
 Dynamic
MissCounting Algorithms: Finding Implication and Similarity Rules with
Confidence Pruning, S. Fujiwara, R. Motwani, and J.D. Ullman. ICDE
2000.
Combinatorics of Association Rules
 New
results on monotone dualization and generating hypergraph transversals, T.
Eiter, G. Gottlob, and K. Makino. STOC 2002.
 Efficient
ReadRestricted Monotone CNF/DNF Dualization by Learning with Membership
Queries, C. Domingo, N. Mishra, and L. Pitt. Machine Learning
37(1):89110 (1999).
 Identifying the Minimal
Transversals of a Hypergraph and Related Problems, T. Eiter, and G.
Gottlob. SIAM Journal on Computing 24(6): 12781304 (1995).
 On the Complexity of
Dualization of Monotone Disjunctive Normal Forms, M.L. Fredman and L.
Khachiyan.
Journal of Algorithms 21(3): 618628 (1996).
 Levelwise search and borders of theories in knowledge
discovery, H. Mannila and H. Toivonen. Data Mining and
Knowledge Discovery 1: 241258 (1997).
Frequency Counting
Clustering  Nina
Mishra
Basic Clustering Algorithms
 Clustering to
minimize the maximum intercluster distance, T.F. Gonzalez. Theoretical
Computer Science, 38: 293306, (1985).
 Criteria for PolynomialTime (Conceptual) Clustering, L.
Pitt and Reinke. Machine Learning 2:371396 (1988).
 Local
Search Heuristics for kmedian and Facility Location Problems, V. Arya, N.
Garg, R. Khandekar, A.Meyerson, K. Munagala and V. Pandit. STOC 2001.
 Improved
Combinatorial Algorithms for the Facility Location and kMedian Problems,
M. Charikar and S. Guha. FOCS 1999.
 Data Clustering: a review, A. K. Jain, M. N.
Murty, and P. J. Flynn, ACM Computing
Surveys, 31(3), 1999.
Clustering Large Data Sets and Streams
 Incremental
Clustering and Dynamic Information Retrieval, M. Charikar, C. Chekuri, T.
Feder and R. Motwani. STOC 1997.
 Clustering
Data Streams, S. Guha, N. Mishra, R. Motwani and L. O'Callaghan. FOCS
2000.
 Sublinear
Time Approximate Clustering, N. Mishra, D. Oblinger, and L. Pitt.
SODA 2001.
 HighPerformance Clustering
of Streams and Large Data Sets, S. Guha, L. O'Callaghan, N. Mishra, A.
Meyerson, and R. Motwani. ICDE 2002.
Database Clustering
 Scaling
Clustering Algorithms to Large Databases, P. Bradley, U. Fayyad, and C.
Reina. KDD 1998.
 CURE:
An Efficient Clustering Algorithm for Large Databases, S. Guha, R.
Rastogi, and K. Shim. SIGMOD 1998. Note: this PDF file requires a
huge amount of temp space (over 200Mb).
 Clustering
Large Datasets in Arbitrary Metric Spaces, V. Ganti, R. Ramakrishnan, J.
Gehrke, A. L. Powell, and J.C. French. ICDE 1999.
 BIRCH: an efficient data clustering method for very large
databases, T. Zhang, R. Ramakrishnan, and M. Livny. SIGMOD 1996.
Spectral Clustering
 Fast
MonteCarlo Algorithms for Finding LowRank Approximations, A. Frieze, R.
Kannan, and S. Vempala, FOCS 1998.
 Clustering in
large graphs and matrices, P. Drineas, R. Kannan, A. Frieze, S. Vempala,
and V. Vinay. SODA 1999.
 On clusterings:
good, bad and spectral, R. Kannan, S. Vempala, and A. Vetta. FOCS
2000.
 Normalized
Cuts and Image Segmentation, J. Shi and J. Malik. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 22(8), 888905, August 2000.
 On
Spectral Clustering: Analysis and an algorithm, A. Y. Ng, M. Jordan, and
Y. Weiss. NIPS 2000.
 Optimal outlier
removal in highdimensional spaces, J, Dunagan and S. Vempala. STOC
2001.
Machine Learning 
Nina
Mishra
Similarity and Nearest Neighbors  Aris Gionis/Rajeev Motwani
Similarity
 On
the Resemblance and Containment of Documents, A. Broder. SEQUENCES
1997.

Syntactic
Clustering of the Web, A. Broder, S. Glassman, M. Manasse, and G.
Zweig, WWW6 1997.
 MinWise
Independent Permutations, A. Broder, M. Charikar, A. Frieze and M.
Mitzenmacher, JCSS 60(3): 630659 (2000).
 Identifying and Filtering NearDuplicate Documents, Andrei Broder. CPM
2000.
 Finding
Interesting Associations without Support Pruning, E. Cohen, M. Datar, S.
Fujiwara, A. Gionis, P. Indyk, R. Motwani, J.D. Ullman, and C. Yang. ICDE
2000.
 Similarity
Estimation Techniques from Rounding Algorithms, M. Charikar, STOC
2002.
Nearest Neighbors
Random Projections
 Extensions of Lipschitz maps into a
Hilbert space, W. Johnson and J. Lindenstrauss. Contemp.
Math., 26 (1984), 189206.
 The JohnsonLindenstrauss lemma and the sphericity of some
graphs, P. Frankl and H. Maehara. Journal of
Combinatorial Theory Series B, 44:355362, 1988.
 An
elementary proof of the Johnson Lindenstrauss lemma, S. Dasgupta and
Anupam Gupta. ICSI TR99006.
 Fast
MonteCarlo Algorithms for Finding LowRank Approximations, A. Frieze, R.
Kannan, and S. Vempala. FOCS 1998.
 Databasefriendly
Random Projections, D. Achlioptas, PODS 2001.
External
Memory Algorithms  Kamesh Munagala
 External
Memory Algorithms and Data Structures, J. S. Vitter. DIMACS Series.
 The
input/output complexity of sorting and related problems, A. Aggarwal and
J. Vitter. CACM 31(9), 1988.
 ExternalMemory
Graph Algorithms, Y.J. Chiang, M. T. Goodrich, E. F. Grove, R.
Tamassia, D. E. Vengroff, and J. S. Vitter. SODA 1995.
 I/OComplexity of
Graph Algorithms, K. Munagala and A. Ranade. SODA 1999.
Web Graph
and Link Analysis  Monika Henzinger/Glen Jeh/Taher Haveliwala
PageRank and
HubsAuthorities
 The
PageRank Citation Ranking: Bringing Order to the Web, L. Page, S. Brin, R.
Motwani, T. Winograd. Stanford Digital Libraries Working Paper, 1998.
 What
can you do with a Web in your Pocket?, S. Brin, R. Motwani, L. Page, and
T. Winograd, Bulletin of the Technical Committee on Data Engineering,
21(1998): 3747.
 Authoritative
sources in a hyperlinked environment, J. Kleinberg. SODA 1998.
Personalized PageRank
Network Algorithms 
Balaji
Prabhakar

Network
Algorithms, A. Goel, N. McKeown, and B. Prabhakar. (Powerpoint Slides of
Tutorial at Sigcomm 2001).
Distributed
Hashing and P2P Networks  Mayur Datar
 Consistent Hashing and
Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the
World Wide Web, D. Karger, E. Lehman, T. Leighton, M. Levine, D. Lewin,
and R. Panigrahy. STOC 1997.
 Web
Caching and Consistent Hashing, D. Karger, A. Sherman, A. Berkheimer, B.
Bogstad, R. Dhanidina, K. Iwamoto, B. Kim, L. Matkins, and Y. Yerushalmi.
WWW 1999.
 Chord:
A Scalable Peertopeer Lookup Service for Internet Applications, I.
Stoica, R. Morris, D. Karger, M. Frans Kaashoek, and H. Balakrishnan.
SIGCOMM 2001.
 Analysis of the
Evolution of Peer to Peer Systems, D. LibenNowell, H. Balakrishnan, and
D. Karger.
 A
Scalable ContentAddressable Network, S. Ratnasamy, P. Francis, M.
Handley, R. Karp, and S. Shenker. SIGCOMM 2001.
 Pastry:
Scalable, distributed object location and routing for largescale peertopeer
systems, A. Rowstron and P. Druschel. Middleware 2001.
 Viceroy: A
Scalable and Dynamic Emulation of the Butterfly, D. Malkhi, M. Naor, and
D. Ratajczak. PODC 2002.
 Tapestry: An
Infrastructure for FaultTolerant WideArea Location and Routing, B. Y.
Zhao, J. D. Kubiatowicz, A. and D. Joseph, Berkeley Report UCB/CSD 01/1141
(2001).
 Distributed
Object Location in a Dynamic Network, K. Hildrum, J. D.
Kubatowicz, S. Rao, and B. Y. Zhao. SPAA 2002.
 Accessing
Nearby Copies of Replicated Objects in a Distributed Environment, C.G.
Plaxton, R. Rajaraman, A. W. Richa. SPAA 1997.
 Search
and replication in unstructured peertopeer networks, Q. Lv, P. Cao, E.
Cohen, K. Li, and S. Shenker. SIGMetrics 2002.
Additional Topics
Epidemics, Gossiping, and Rumor
Mongering
 Epidemic Algorithms for Replicated Database Maintenance. A.J. Demers, D.H.
Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H.E. Sturgis, D.C.
Swinehart, and D.B. Terry. Operating Systems Review 22(1): 832
(1988).
 Scalable and secure resource
location, R. van Renesse. HICSS 2000.
 Randomized Rumor
Spreading, R. Karp, C. Schindelhauer, S. Shenker, and B. Vocking.
FOCS 2000.
 Spatial gossip and
resource location protocols, D. Kempe, J. Kleinberg, and A. Demers.
STOC 2001.
OLAP and Datacubes
Fuzzy Information and
Aggregation Algorithms
Compression
 Notes
on Compression by Guy Blelloch from CMU.
 Managing Gigabytes: Compressing and
Indexing Documents and Images, I.H. Witten, A. Moffat and T. C. Bell,
Morgan Kauffman, 1999.
 Data
Compression: The Complete Reference, D. Salomon, Springer Verlag,
1998.
 Introduction
to Data Compression, Second Edition, K. Sayood, Morgan Kaufmann, 2000.
 Compressed
Bloom Filters, M. Mitzenmacher, PODC 2001.
Indexing and Searching:
Linear Algebra in Information
Retrieval
Errorcorrecting Codes

Michael Mitzenmacher's slides on
codes (Shannon's theorem and introduction to error correcting codes: ps,
ReedSolomon Codes: ps)

Practical LossResilient
Codes, M. Luby, M. Mitzenmacher, A. Shokrollahi, D. Spielman,
and V. Stemann, STOC 1997. (ps,
pdf,
slides(ppt))

Analysis of Random Processes via
AndOr Tree Evaluation, M. Luby, M. Mitzenmacher, and A. Shokrollahi,
SODA 1998. (ps,
pdf)

Analysis of Low Density Codes and
Improved Designs Using Irregular Graphs, M. Luby, M. Mitzenmacher, A.
Shokrollahi, and D. Spielman, STOC 1998. (ps,
pdf,
slides(ppt))