CS362: Algorithmic Frontiers: Effective Algorithms for Large [and Small] Data



Wednesday, 6/4, come prepared to give a ~10-minute presentation on your final project, which will be due Tuesday, 6/10.



Instructor: Gregory Valiant (Office hours: Thursday 3-4:15, Gates 470. Send email to my last name at stanford dot edu).

Time/location: 4:15-5:30 Mon/Wed. Location: Building 60, room 120

Course description: With the increasing role of datasets in science and society at large, the algorithmic question of how to efficiently extract desired information from these datasets has taken center stage. This course will explore some of the recent work from the Theory community that addresses this general question. There will by two main directions: algorithms for estimating properties/structure of the objects underlying the data, and algorithms that restructure the data so as to enable more efficient access to the information contained in the data. The first vein captures some of the challenge of trying to understand extremely large, and complex objects---from social networks and gene interactions, to complex distributions over enormous domains: in such settings, the data available often only represents a tiny fraction of the underlying object we hope to understand. What properties of graphs and distributions can be inferred from such sparse samples? The second vein will explore approaches to representing complex datasets via smaller objects that preserve key features of the original dataset. Additionally, we will also cover a variety of efficient algorithms for computing useful functions of the data. The course will provide some background on the questions and tools considered, though the focus will be on the research frontier of this area, with an emphasis on developing a high-level understanding of the lay-of-the-land of relevant techniques and open problems/directions for future work.

Topics: Distributional property testing and estimation, testing properties of graphs, the method of moments and tensor methods, ICA (independent component analysis), dictionary learning and population recovery, distance oracles, nearest neighbor search (including recent work on non-oblivious hashing), Fast JL transforms, compressed sensing/sparse recovery, fast sparse Fourier transforms, and a variety of tools from probability theory for tackling these topics. The topics covered will vary slightly in accordance with the interests of the class.

Prerequisites: Some background in probability theory, linear algebra, and algorithms will be crucial. A high level of mathematical maturity will be assumed.

Evaluation: In-class discussions will be an important component of this class, and hence attendance will be expected. Students may also be expected to give an in-class presentation or complete scribe notes for a lecture, in addition to a final research project.



Lecture 1 (3/31): Introduction/course overview.

Lecture 2 (4/2): Birth of distributional property testing. Summary of what is known w.r.t learning arbitrary distributions over support size n, and testing "identity" w.r.t L1 and L2 distances, in the case where one of the distributions is known (i.e. given description of p, and samples from unknown q, decide if p=q versus p,q far), and where both distributions are unknown and samples are drawn from both.
References: testing the uniform distribution (motivated by testing expansion in graphs): [Goldreich and Ron '00], instance optimal testing (tight bounds on number of samples required as a function of the distribution p, as opposed to worst-case bounds in terms of an upper bound on support size): [VV'14], Batu et al. O(n^(2/3)) algorithmic result for the two unknown distribution case [BFRSW'10].

Lecture 3 (4/7): Algorithmic characterization/proof approach for generalizations of Holder/Lp monotonicity inequalities, which are frequently encountered when trying to analyze performance of various testers/estimators via Chebyshev's inequality.
References: first part of the paper [VV'14].

Lecture 4 (4/9): Proof of the O(n^(2/3)) lower bound on testing identity between two unknown distributions (given access to samples), and quick proof of the folklore O(n/eps^2) sample complexity for learning an unknown distribution of support size at most n. The sketch of the proof of the n^(2/3) lower bound is as follows: define p,q to each have n^{2/3} 'heavy' elements, and O(n) 'light' elements, where the sets of light elements are disjoint in the case that ||p-q|| = 1/4, and are equal in the case that p=q. The analysis proceeds by first noting that we can assume that we are given a Poisson distributed number of samples, making the number of occurrences of each domain element independent. In this case, the distribution over the number of elements seen i times from p and j times from q is a "generalized multinomial" distribution, and hence this can be approximated by a multivariate Poisson distribution, using a theorem of Roos. This independence of the different coordinates then lets one obtain a total variation distance bound between the distribution of the samples in the p=q case, and the ||p-q||=1/4 case by analyzing each coordinate separately.
References: Roos' theorem comparing generalized multinomial distributions to the multivariate Poisson distribution [Roos'99]

Lecture 5 (4/14): Estimating the "histogram" of a distribution (the set of probability values of the elements), from which properties such as entropy, support size, and distance metrics between distributions (for the 2-distribution generalization of "histogram"); we sketched the algorithm and proof approach for the (tight) O(n/log n) bounds on the sample complexity.
References: this paper has both the theory, and some actual implementation results ['13].

Lecture 6 (4/16): A brief intro to Stein's method, and some general approaches to transferring earthmover distance bounds to L1 distance bounds. (And a "second-moment" analog of Roos' theorem characterizing generalized multinomial distributions via their column expectations.)
References: For a thorough book/lecture notes on Stein's method, check out Chen and Shao's book/lecture notes on Stein's method for normal approximation: a free pre-print is available here.
Scribes: Hamsa and Osbert. Scribe notes here.

Lecture 7 (4/21): An into to the method of moments, via the progression of work on learning mixtures of Gaussians.
Scribe: Katelyn. Scribe notes here.

Lecture 8 (4/23): Tensor Methods!
Scribes: Michael and Yongxing. Scribe notes here.

Lecture 9 (4/28): Tensor Methods, part 2, blessing of dimensionality.
Scribes: Jason. Scribe notes coming soon....

Lecture 10 (4/30): Independent Component Analysis.
Scribes: Bhaswar. Scribe notes coming soon....

Lecture 11 (5/5): Thorup/Zwick Distance Oracles.

Lecture 12 (5/7): Characterizing graphs: Szemeredi's regularity lemma, and Newman/Sohler's characterization of bounded degree planar/hyperfinite graphs.

Lecture 13 (5/12): Intro to LSH, and the spherical bucketing trick of Andoni/Indyk'06 for the L2 metric.

Lecture 14 (5/14): "Beyond LSH"--guest lecture by Alexandr Andoni from MSR!!
Abstract: Locality Sensitive Hashing (LSH) has emerged as a powerful tool for solving the Nearest Neighbor Search in high-dimensional spaces. The original scheme introduced in 1998 by Indyk and Motwani works for the Hamming space and has been later proven to be the optimal LSH algorithm.

We present a new data structure for approximate nearest neighbor search improving the query time Q of Indyk-Mowani to essentially Q^{7/8}. (In technical terms, Q~=n^{1/c} for n points and approximation c.) Thus, this is not only the first improvement over the original LSH scheme, but it in fact circumvents the LSH lower bound itself!

Unlike previous algorithms, the new algorithm considers *data-dependent* hashing of the space, showing one can gain from a careful exploitation of the geometry of (any worst-case) dataset at hand. We expect that this technique will lead to more improvements in future.

(Joint work with Piotr Indyk, Huy Nguyen, and Ilya Razenshteyn.)

Lecture 15 (5/19): Finding correlations and close points:
Today, we will begin by talking about the closely related question of finding the closest pair of vectors from among a set of vectors. We will see a very different approach to this problem that does NOT proceed by hashing. We will then see some connections between this question, and the question of learning "juntas"--namely, learning a function that only depends on a small set of "relevant" variables, where the bulk of the challenge is in figuring out which variables are relevant. [For example, imagine one has lots of human genomes, and one wants to determine a small set (say, 10 bases), that largely determine whether or not someone has a given condition.] This question of learning juntas, and the related question of "learning parity with noise" are some of my favorite largely open research questions....

Lecture 16 (5/21): More parity with noise and learning juntas

Lecture 17 (5/28): Matrix Completion: Guest lecture by Moritz Hardt from IBM Almaden!!
Abstract: Alternating minimization for is a popular non-convex heuristic for matrix completion and related matrix factorization problems. This lecture is a gentle introduction to recent theoretical advances in understanding alternating minimization.

Lecture 18 (6/2): Fast Dimension Reduction (Fast Johnson-Lindenstrauss).

Lecture 19 (6/4): Final Project Presentations!!!