CS 361A

CS 361A
(Advanced Algorithms for Internet Applications)

News Flash Administrivia Sign-up Course Overview Topics Schedule Reading List

News Flash

Midterm Exam We have prepared a take-home midterm exam which is due in class on Wednesday, Nov 13. The exam is available here in postscript and pdf formats.

Administrivia
        Instructors:   Rajeev Motwani and Nina Mishra
        Teaching Assistant: Mayur Datar
        Class Schedule: Mon/Wed, 3:15-4:30, Gates B12
        Office Hours:
                                [Mayur Datar]            Thu/Fri, 1:30-2:30, Gates 482
                                [Nina Mishra]             Thu, 3:00-4:00, Gates 484
                                [Rajeev Motwani]        Tue, 1:30-2:30, Gates 474

Class Mailing List We have set up a class mailing list. Please subscribe to it to get latest information regarding the class. The email address is cs361a-class@lists.stanford.edu. You can subscribe by sending a mail to majordomo@lists.stanford.edu with the following text in the body of the mail: subscribe cs361a-class

Grading Policy Revision Several MS students want to take this course on for a letter grade to satisfy their specialization requirements. We have modified our sign-up policy to allow this, so feel free to sign up for a letter grade. We will use the scribe notes (see below), a couple of homeworks, and class participation to determine the grade for these students. So, if you sign up for a letter grade, be sure to serve as a scribe for at least one lecture.

Sign-Up

To sign up for this course, please send email to Mayur Datar with the following information:
name, department, status (Phd/MS/UG, year), area of specialization (Databases, Systems, Theory, etc), and email address.

Course Overview

With the maturing of the Internet, the field of algorithms is undergoing an interesting transformation. For one thing, new areas and applications requiring an algorithmic mind-set have emerged, such as information retrieval and web searching, massive and streaming data, data mining, machine learning, distributed systems (including so-called P2P networks), and network algorithms. To service these, novel algorithmic techniques have been and are being developed. Furthermore, new applications have led to new models for algorithms, most prominent of which is the field of algorithms for data streams. This course will give an overview of such topics with an eye towards identifying interesting research directions. Since Stanford people have played a prominent role in these new developments, wherever possible we will attempt to bring in as guest lecturers the original authors of the papers being covered. This course should be of interest to graduate students in computer science and related fields, especially those with a mathematical bent of mind. We will assume familiarity with basic material in algorithms, databases, probability, etc., (at the level of the core undergraduate courses on these topics).

Grading Since this course will be treated as a graduate research seminar, we expect students will register pass/fail (and not for a letter grade). There will be little by way of formal exams, although we may have occasional homework assignments. In fact, most of the grade will depend on class participation and the scribe notes prepared by students (see below).

Scribe Each registered student will sign up as the official scribe for a specific lecture. This involves taking detailed notes, reading the background papers, and preparing a set of lecture notes that will be handed out to the entire course.

Topics

Introduction: Computing Distinct Values (1 week)
Streaming algorithms (2 weeks)
Data Streams 1 (Sampling/Sketches/Synopses)
Data Streams 2 (Histograms/Algorithms)

Data Mining algorithms (3 weeks)
Association Rules
Clustering
Machine Learning

Nearest Neighbors and Similarity (1 week)
External Memory Algorithms (1 week)
Web Graph and Link Analysis (1 week)
Network Algorithms (1 week)
Distributed Hashing and P2P Networks (1 week)
Additional Topics
Epidemics, Gossiping, and Rumor Mongering (1 week)
OLAP and Data Cubes (1 week)
Fuzzy Information and Aggregation Algorithms (1 week)
Compression (1 week)
Indexing and Searching (1 week)
Linear Algebra in Information Retrieval (1 week)
Error-Correcting Codes (1 week)

Schedule

Week	Dates	Topic	Lecturer	Slides	Scribe Notes
1	Wed, Sep 25	Introduction: Computing Distinct Values	Rajeev Motwani	Slides 1 (ppt)	Scribe 1 (ps, pdf)
2	Mon, Sep 30	Data Streams 1 (Sampling/Sketching/Synopses)	Rajeev Motwani Mayur Datar	Slides 2 (ppt) Slides 3 (ppt)	Scribe 2 (ps, pdf)
	Wed, Oct 2
3	Mon, Oct 7	Data Streams 2 (Histograms/Quantiles)	Rajeev Motwani Gurmeet Manku	Slides 4 (ppt) Slides 5 (ppt)	Scribe 4.1 (doc, pdf) Scribe 4.2 (ps, pdf)
	Wed, Oct 9
4	Mon, Oct 14	Association Rules	Rajeev Motwani Nina Mishra	Slides 6 (ppt) Slides 7(ppt)	Scribe 6.1 (doc, pdf) Scribe 6.2 (ps, pdf) Scribe 7 (ps, pdf)
	Wed, Oct 16
5	Mon, Oct 21	Clustering	Nina Mishra	Slides 8 (ppt) Slides 9 (ppt)	Scribe 8 (doc, pdf) Scribe 9.1 (ps, pdf) Scribe 9.2 (doc, pdf)
	Wed, Oct 23
6	Mon, Oct 28	Machine Learning	Nina Mishra	Slides 10 (ppt) Slides 11 (ppt)	Scribe 10 (ps, pdf) Scribe 11 (ps, pdf)
	Wed, Oct 30
7	Mon, Nov 4	Nearest Neighbors and Similarity	Aris Gionis Rajeev Motwani	Slides 12 (ps, pdf) Slides 13 (ppt)	Scribe 12.1 (doc, pdf) Scribe 12.2 (doc, pdf) Scribe 13 (ps, pdf)
	Wed, Nov 6
8	Mon, Nov 11	External Memory Algorithms	Rajeev Motwani Kamesh Munagala	Slides 14 (ppt) Slides 15 (ppt)	Scribe 15 (doc, pdf)
	Wed, Nov 13
9	Mon, Nov 18	Web Graph and Link Analysis	Monika Henzinger Glen Jeh Taher Haveliwala	Slides 16 (ppt) Slides 17 (ppt)	Scribe 16 (doc, pdf) Scribe 16.2 (doc, ps) Scribe 17.1 (doc, pdf) Scribe 17.2 (doc, pdf) Scribe 17.3 (doc, pdf)
	Wed, Nov 20
10	Mon, Nov 25	Network Algorithms	Balaji Prabhakar	Slides 18 (ps) Slides 19 (ps)	Scribe 19.1 (ps, pdf) Scribe 19.2 (doc, pdf)
	Wed, Nov 27
11	Mon, Dec 2	Distributed Hashing and P2P Networks Removing Duplicates	Datar/ Motwani Andrei Broder	Slides 20 (ppt)
	Wed, Dec 4

Reading List

Introduction: Computing Distinct Values - Rajeev Motwani

Towards Estimation Error Guarantees for Distinct Values, M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. PODS 2000.
Probabilistic counting algorithms for data base applications. P. Flajolet and G. N. Martin. JCSS 31, 2, 1985.
A Linear Time Probabilistic Counting Algorithm for Database Applications, K-Y. Whang, B. V. Zanden, and H. Taylor. TODS 15, 2, 1990.
The space complexity of approximating the frequency moments, N. Alon, Y. Matias, and M. Szegedy. STOC 1996.
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports, P. B. Gibbons. VLDB 2001.

Data Streams 1 (Sampling/Sketching/Synopses) - Rajeev Motwani/Mayur Datar

Introduction

Models and Issues in Data Stream Systems. Talk by Rajeev Motwani at SIGMOD/PODS 2002.
- Slides in PowerPoint
- Slides in pdf
Models and Issues in Data Stream Systems, B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom.
PODS 2002.
STREAM: STanford stREam datA Manager. Web page of the Stanford Stream Project (check out the relevant papers there).
Computing on Data Streams, M. Henzinger, P. Raghavan, and S. Rajagopalan. SRC Technical Note 1998-011.

Sampling

Random Sampling with a Reservoir, J. S. Vitter. Trans. on Mathematical Software 11(1):37-57 (1985).
Sampling from a Moving Window over Streaming Data, B. Babcock, M. Datar, and R. Motwani. SODA 2002.
On Random Sampling over Joins, S. Chaudhuri, R. Motwani, and V. Narasayya. SIGMOD 1999.
Towards Estimation Error Guarantees for Distinct Values, M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. PODS 2000.
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports, P. B. Gibbons. VLDB 2001.
Sampling algorithms: lower bounds and applicaitons, Z. Bar-Yossef, S. Ravi Kumar, and D. Sivakumar.STOC 2001.

Sketching

Probabilistic counting algorithms for data base applications. P. Flajolet and G. N. Martin. JCSS 31, 2, 1985.
A Linear Time Probabilistic Counting Algorithm for Database Applications, K-Y. Whang, B. V. Zanden, and H. Taylor. TODS 15, 2, 1990.
The space complexity of approximating the frequency moments, N. Alon, Y. Matias, and M. Szegedy. STOC 1996.
Finding Frequent Items in Data Streams, M. Charikar, K. Chen, and M. Farach-Colton. ICALP 2002.
An Approximate L1-Difference Algorithm for Massive Data Streams, J. Feigenbaum, S. Kannan, M. Strauss, M. Viswanathan. FOCS 1999.
Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation, P. Indyk. FOCS 2000.

Synopses

Tracking Join and Self-Join Sizes in Limited Storage, N. Alon, P. Gibbons, Y. Matias and M. Szegedy. PODS 1999.

Join Synopses for Approximate Query Answering, S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy. SIGMOD 1999.

Congressional Samples for Approximate Answering of Group-By Queries, S. Acharya, P, Gibbons, and V. Poosala. SIGMOD 2000.

Overcoming Limitations of Sampling for Aggregation Queries, S. Chaudhuri, G. Das, M. Datar, R. Motwani and V. Narasayya. ICDE 2001.
A Robust Optimization-Based Approach for Approximate Answering of Aggregate Queries, S. Chaudhuri, G. Das and V. Narasayya. SIGMOD 2001.
Synopsis data structures for massive data sets, P. B. Gibbons and Y. Matias, DIMACS 1999.

Data Streams 2 (Synopses/Algorithms) - Rajeev Motwani/Gurmeet Manku

Quantiles and Histograms

The New Jersey data reduction report, D. Barbara et al. IEEE Data Engineering Bulletin 1997.
Optimal Histograms with Quality Guarantees, H.V. Jagadish, N.Koudas, S. Muthukrishnan, V. Poosala, K. Sevcik, and T. Suel. VLDB 1998.
Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets, G. S. Manku, S. Rajagopalan, and B. G. Lindsay. SIGMOD 1999.
Space-efficient online computation of quantile summaries, M. Greenwald and S. Khanna. SIGMOD 2001.
Dynamic multidimensional histograms, N. Thaper, S. Guha, P. Indyk, and N. Koudas. SIGMOD 2002.
Fast, small-space algorithms for approximate histogram maintenance, A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. Strauss. STOC 2002.

Sliding Window Algorithms

Maintaining Stream Statistics Over Sliding Windows, M. Datar, A. Gionis, P. Indyk, R. Motwani. SODA 2002.
Distributed Streams Algorithms for Sliding Windows, P. Gibbons, S. Tirthapura. SPAA 2002.
Maintaining Variance and K-Medians over Data Stream Windows. B. Babcock, M. Datar, R. Motwani, and L. O'Callaghan. Manuscript, 2002.

Association Rules - Rajeev Motwani/Nina Mishra

Association Rule Mining and Generalizations

Mining Associations between Sets of Items in Massive Databases, R. Agrawal, T. Imielinski, and A. Swami. SIGMOD 1993.
Fast Algorithms for Mining Association Rules, R. Agrawal and R. Srikant. VLDB 1994.
An Effective Hash-Based Algorithm for Mining Association Rules, J. S. Park, M.-S. Chen, and P. S. Yu. SIGMOD 1995.
An Efficient Algorithm for Mining Association Rules in Large Databases , A. Savasere, E. Omiecinski, and S. Navathe. The VLDB Journal 1995.
Sampling Large Databases for Association Rules, H. Toivonen. VLDB 1996.
Dynamic Itemset Counting and Implication Rules for Market Basket Data, S. Brin, R. Motwani, S. Tsur, and J.D. Ullman. SIGMOD 1997.
Query Flocks: A Generalization of Association-Rule Mining, D. Tsur, J.D. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov and A. Rosenthal. SIGMOD 1998.
Finding Interesting Associations without Support Pruning, E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J.D. Ullman, and C. Yang. ICDE 2000.
Dynamic Miss-Counting Algorithms: Finding Implication and Similarity Rules with Confidence Pruning, S. Fujiwara, R. Motwani, and J.D. Ullman. ICDE 2000.

Combinatorics of Association Rules

New results on monotone dualization and generating hypergraph transversals, T. Eiter, G. Gottlob, and K. Makino. STOC 2002.
Efficient Read-Restricted Monotone CNF/DNF Dualization by Learning with Membership Queries, C. Domingo, N. Mishra, and L. Pitt. Machine Learning 37(1):89-110 (1999).
Identifying the Minimal Transversals of a Hypergraph and Related Problems, T. Eiter, and G. Gottlob. SIAM Journal on Computing 24(6): 1278-1304 (1995).
On the Complexity of Dualization of Monotone Disjunctive Normal Forms, M.L. Fredman and L. Khachiyan.
Journal of Algorithms 21(3): 618-628 (1996).
Levelwise search and borders of theories in knowledge discovery, H. Mannila and H. Toivonen. Data Mining and Knowledge Discovery 1: 241-258 (1997).

Frequency Counting

Computing iceberg queries efficiently, M. Fang, N. Shivakumar, H. Garcia-Molica, R. Motwani and J.D. Ullman. VLDB 1998.
Approximate Frequency Counts over Data Streams, G. Manku, and R. Motwani. VLDB 2002.

Clustering - Nina Mishra

Basic Clustering Algorithms

Clustering to minimize the maximum intercluster distance, T.F. Gonzalez. Theoretical Computer Science, 38: 293-306, (1985).
Criteria for Polynomial-Time (Conceptual) Clustering, L. Pitt and Reinke. Machine Learning 2:371-396 (1988).
Local Search Heuristics for k-median and Facility Location Problems, V. Arya, N. Garg, R. Khandekar, A.Meyerson, K. Munagala and V. Pandit. STOC 2001.
Improved Combinatorial Algorithms for the Facility Location and k-Median Problems, M. Charikar and S. Guha. FOCS 1999.
Data Clustering: a review, A. K. Jain, M. N. Murty, and P. J. Flynn, ACM Computing Surveys, 31(3), 1999.

Clustering Large Data Sets and Streams

Incremental Clustering and Dynamic Information Retrieval, M. Charikar, C. Chekuri, T. Feder and R. Motwani. STOC 1997.
Clustering Data Streams, S. Guha, N. Mishra, R. Motwani and L. O'Callaghan. FOCS 2000.
Sublinear Time Approximate Clustering, N. Mishra, D. Oblinger, and L. Pitt. SODA 2001.
High-Performance Clustering of Streams and Large Data Sets, S. Guha, L. O'Callaghan, N. Mishra, A. Meyerson, and R. Motwani. ICDE 2002.

Database Clustering

Scaling Clustering Algorithms to Large Databases, P. Bradley, U. Fayyad, and C. Reina. KDD 1998.

CURE: An Efficient Clustering Algorithm for Large Databases, S. Guha, R. Rastogi, and K. Shim. SIGMOD 1998. Note: this PDF file requires a huge amount of temp space (over 200Mb).

Clustering Large Datasets in Arbitrary Metric Spaces, V. Ganti, R. Ramakrishnan, J. Gehrke, A. L. Powell, and J.C. French. ICDE 1999.

BIRCH: an efficient data clustering method for very large databases, T. Zhang, R. Ramakrishnan, and M. Livny. SIGMOD 1996.

Spectral Clustering

Fast Monte-Carlo Algorithms for Finding Low-Rank Approximations, A. Frieze, R. Kannan, and S. Vempala, FOCS 1998.
Clustering in large graphs and matrices, P. Drineas, R. Kannan, A. Frieze, S. Vempala, and V. Vinay. SODA 1999.
On clusterings: good, bad and spectral, R. Kannan, S. Vempala, and A. Vetta. FOCS 2000.
Normalized Cuts and Image Segmentation, J. Shi and J. Malik. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888-905, August 2000.
On Spectral Clustering: Analysis and an algorithm, A. Y. Ng, M. Jordan, and Y. Weiss. NIPS 2000.
Optimal outlier removal in high-dimensional spaces, J, Dunagan and S. Vempala. STOC 2001.

Machine Learning - Nina Mishra

The Weighted Majority Algorithm, N, Littlestone and M.K. Warmuth. FOCS 1992.
Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm, N. Littlestone. Machine Learning, 2(4):285--318, 1988.
Machine Learning, T. Mitchell. McGraw Hill, 1997. (The website for the book has additional materials such as slides).
An Introduction to Computational Learning Theory, M. Kearns and U. Vazirani. MIT Press, 1994.

Similarity and Nearest Neighbors - Aris Gionis/Rajeev Motwani

Similarity

On the Resemblance and Containment of Documents, A. Broder. SEQUENCES 1997.
Syntactic Clustering of the Web, A. Broder, S. Glassman, M. Manasse, and G. Zweig, WWW6 1997.
Min-Wise Independent Permutations, A. Broder, M. Charikar, A. Frieze and M. Mitzenmacher, JCSS 60(3): 630-659 (2000).
Identifying and Filtering Near-Duplicate Documents, Andrei Broder. CPM 2000.
Finding Interesting Associations without Support Pruning, E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J.D. Ullman, and C. Yang. ICDE 2000.
Similarity Estimation Techniques from Rounding Algorithms, M. Charikar, STOC 2002.

Nearest Neighbors

Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality, P. Indyk and R. Motwani. STOC 1998.
A Replacement for Voronoi Diagrams of Near Linear Size, S. Har-Peled. FOCS 2001.
Similarity Search in High Dimensions via Hashing, A. Gionis, P. Indyk, and R. Motwani. VLDB 1999.

Random Projections

Extensions of Lipschitz maps into a Hilbert space, W. Johnson and J. Lindenstrauss. Contemp. Math., 26 (1984), 189-206.
The Johnson-Lindenstrauss lemma and the sphericity of some graphs, P. Frankl and H. Maehara. Journal of Combinatorial Theory Series B, 44:355-362, 1988.
An elementary proof of the Johnson Lindenstrauss lemma, S. Dasgupta and Anupam Gupta. ICSI TR-99-006.
Fast Monte-Carlo Algorithms for Finding Low-Rank Approximations, A. Frieze, R. Kannan, and S. Vempala. FOCS 1998.
Database-friendly Random Projections, D. Achlioptas, PODS 2001.

External Memory Algorithms - Kamesh Munagala

External Memory Algorithms and Data Structures, J. S. Vitter. DIMACS Series.
The input/output complexity of sorting and related problems, A. Aggarwal and J. Vitter. CACM 31(9), 1988.
External-Memory Graph Algorithms, Y.-J. Chiang, M. T. Goodrich, E. F. Grove, R. Tamassia, D. E. Vengroff, and J. S. Vitter. SODA 1995.
I/O-Complexity of Graph Algorithms, K. Munagala and A. Ranade. SODA 1999.

Web Graph and Link Analysis - Monika Henzinger/Glen Jeh/Taher Haveliwala

PageRank and Hubs-Authorities

The PageRank Citation Ranking: Bringing Order to the Web, L. Page, S. Brin, R. Motwani, T. Winograd. Stanford Digital Libraries Working Paper, 1998.
What can you do with a Web in your Pocket?, S. Brin, R. Motwani, L. Page, and T. Winograd, Bulletin of the Technical Committee on Data Engineering, 21(1998): 37-47.
Authoritative sources in a hyperlinked environment, J. Kleinberg. SODA 1998.

Personalized PageRank

Topic-Sensitive Page-Rank, Taher H. Haveliwala. WWW Conference, 2002
Scaling Pesonalized Web Search, G. Jeh and J. Widom. Technical Report, 2002.

Network Algorithms - Balaji Prabhakar

Network Algorithms, A. Goel, N. McKeown, and B. Prabhakar. (Powerpoint Slides of Tutorial at Sigcomm 2001).

Distributed Hashing and P2P Networks - Mayur Datar

Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web, D. Karger, E. Lehman, T. Leighton, M. Levine, D. Lewin, and R. Panigrahy. STOC 1997.
Web Caching and Consistent Hashing, D. Karger, A. Sherman, A. Berkheimer, B. Bogstad, R. Dhanidina, K. Iwamoto, B. Kim, L. Matkins, and Y. Yerushalmi. WWW 1999.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications, I. Stoica, R. Morris, D. Karger, M. Frans Kaashoek, and H. Balakrishnan. SIGCOMM 2001.
Analysis of the Evolution of Peer to Peer Systems, D. Liben-Nowell, H. Balakrishnan, and D. Karger.
A Scalable Content-Addressable Network, S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. SIGCOMM 2001.
Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems, A. Rowstron and P. Druschel. Middleware 2001.
Viceroy: A Scalable and Dynamic Emulation of the Butterfly, D. Malkhi, M. Naor, and D. Ratajczak. PODC 2002.
Tapestry: An Infrastructure for Fault-Tolerant Wide-Area Location and Routing, B. Y. Zhao, J. D. Kubiatowicz, A. and D. Joseph, Berkeley Report UCB/CSD 01/1141 (2001).
Distributed Object Location in a Dynamic Network, K. Hildrum, J. D. Kubatowicz, S. Rao, and B. Y. Zhao. SPAA 2002.
Accessing Nearby Copies of Replicated Objects in a Distributed Environment, C.G. Plaxton, R. Rajaraman, A. W. Richa. SPAA 1997.
Search and replication in unstructured peer-to-peer networks, Q. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker. SIGMetrics 2002.

Additional Topics

Epidemics, Gossiping, and Rumor Mongering

Epidemic Algorithms for Replicated Database Maintenance. A.J. Demers, D.H. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H.E. Sturgis, D.C. Swinehart, and D.B. Terry. Operating Systems Review 22(1): 8-32 (1988).
Scalable and secure resource location, R. van Renesse. HICSS 2000.
Randomized Rumor Spreading, R. Karp, C. Schindelhauer, S. Shenker, and B. Vocking. FOCS 2000.
Spatial gossip and resource location protocols, D. Kempe, J. Kleinberg, and A. Demers. STOC 2001.

OLAP and Datacubes

Implementing Data Cubes Efficiently, V. Harinarayan, A. Rajaraman, and J.D. Ullman. SIGMOD 1996.
Index Selection for OLAP, H. Gupta, V. Harinarayan, A. Rajaraman, and J.D. Ullman. ICDE 1997.
On the complexity of the view selection problem, H. Karloff and M.Michail. PODS 1999.
Efficient Implementation of Data Cubes Via Materialized Views, J.D. Ullman. KDD 1996.

Fuzzy Information and Aggregation Algorithms

Comparing top k lists, R. Fagin, R. Kumar and D. Sivakumar. SODA 2003 (to appear).
Combining fuzzy information: an overview, SIGMOD Record 31,2, June 2002, pp. 109-118.
Optimal aggregation algorithms for middleware, R. Fagin, A. Lotem, and M. Naor. PODS 2001.

Compression

Notes on Compression by Guy Blelloch from CMU.
Managing Gigabytes: Compressing and Indexing Documents and Images, I.H. Witten, A. Moffat and T. C. Bell, Morgan Kauffman, 1999.
Data Compression: The Complete Reference, D. Salomon, Springer Verlag, 1998.
Introduction to Data Compression, Second Edition, K. Sayood, Morgan Kaufmann, 2000.
Compressed Bloom Filters, M. Mitzenmacher, PODC 2001.

Indexing and Searching:

Managing Gigabytes: Compressing and Indexing Documents and Images, I.H. Witten, A. Moffat and T. C. Bell, Morgan Kauffman, 1999.
Modern Information Retrieval, Baeza-Yates and Ribeiro-Neto, Addison-Wesley, 1999.

Linear Algebra in Information Retrieval

Using Linear Algebra for Intelligent Information Retrieval, Michael W. Berry, Susan T. Dumais, and Gavin W. O'Brien. SIAM Review 37(4):573-595, 1995.
Latent Semantic Indexing: A Probabilistic Analysis, Christos Papadimitriou, Prabhakar Raghavan, Hisao Tamaki and Santosh Vempala. PODS 1998.

Error-correcting Codes

Michael Mitzenmacher's slides on codes (Shannon's theorem and introduction to error correcting codes: ps, Reed-Solomon Codes: ps)
Practical Loss-Resilient Codes, M. Luby, M. Mitzenmacher, A. Shokrollahi, D. Spielman, and V. Stemann, STOC 1997. (ps, pdf, slides(ppt))
Analysis of Random Processes via And-Or Tree Evaluation, M. Luby, M. Mitzenmacher, and A. Shokrollahi, SODA 1998. (ps, pdf)
Analysis of Low Density Codes and Improved Designs Using Irregular Graphs, M. Luby, M. Mitzenmacher, A. Shokrollahi, and D. Spielman, STOC 1998. (ps, pdf, slides(ppt))