For a broad class of practically relevant distribution properties, which includes entropy and support size, nearly all of the proposed estimators have an especially simple form. Given a set of independent samples from a discrete distribution, these estimators tally the vector of summary statistics---the number of domain elements seen once, twice, etc. in the sample---and output the dot product between these summary statistics, and a fixed vector of coefficients. We term such estimators linear.
This historical proclivity towards linear estimators is slightly perplexing, since, despite many efforts over nearly 60 years, all proposed such estimators have significantly suboptimal convergence, compared to the bounds shown in [Valiant and Valiant, 2011].
Our main result, in some sense vindicating this insistence on linear estimators, is that for any property in this broad class, there exists a near-optimal linear estimator. Additionally, we give a practical and polynomial-time algorithm for constructing such estimators for any given parameters.
While this result does not yield explicit bounds on the sample complexities of these estimation tasks, we leverage the insights provided by this result, to give explicit constructions of near-optimal linear estimators for three properties: entropy, L1 distance to uniformity, and for pairs of distributions, L1 distance.
Our entropy estimator, when given O(n/c log n) independent samples from a distribution of support at most n, will estimate the entropy of the distribution to within accuracy c, with probability of failure o(1/poly(n)). From the recent lower bounds given in [Valiant and Valiant, 2011], this estimator is optimal, to constant factor, both in its dependence on n, and its dependence on c. In particular, the inverse-linear convergence rate of this estimator resolves the main open question of [Valiant and Valiant, 2011], which left open the possibility that the error decreased only with the square root of the number of samples.
Our distance to uniformity estimator, when given O(m/c^2 log m)$ independent samples from any distribution, returns an c-accurate estimate of the L1 distance to the uniform distribution of support m. This is the first sublinear-sample estimator for this problem, and is constant-factor optimal, for constant c.
Finally, our framework extends naturally to properties of pairs of distributions, including estimating the L1 distance and KL-divergence between pairs of distributions. We give an explicit linear estimator for estimating L1 distance to accuracy c using O(n/c^2 log n) samples from each distribution, which is constant-factor optimal, for constant c.