论文标题
降低总和估计
Bias Reduction for Sum Estimation
论文作者
论文摘要
在经典的统计和分销测试中,通常假定可以从某些发行版$ p $采样元素,并且当对元素$ x $采样时,也已知$ x $的概率$ p $。最新的分销测试工作表明,如果从任何足够接近$ p $的分发$ q $中汲取元素,许多算法仍然可以产生正确的输出。这种现象提出了有趣的问题:在什么条件下,“嘈杂”的分布$ q $足够了,应对这种噪音的算法成本是多少? 我们研究了这些问题,以估计$ n $ real Value $ x_1,\ ldots,x_n $的多键的总和。在$ p = q $的情况下,在统计文献中进行了充分研究,其中经常使用Hansen-Hurwitz估计器。我们假设对于某些已知的分销$ P $,值是从接近$ p $的分布$ Q $中取样的。对于每一个正整数$ k $,我们定义一个估计器$ζ_K$,for $μ= \ sum_i x_i $,其偏见与$γ^k $成比例(其中我们的$ζ_1$还原于古典Hansen-Hurwitz估算器)。作为一种特殊情况,我们表明,如果$ q $是$γ$ - close to Uniform corribers complose,以及所有$ x_i \ in \ {0,1 \} $,对于任何$ε> 0 $,我们可以估计使用$ m =θ(n^n^n^n^{1- \ frac {1- \ frac}} /估计添加错误$εn$中的$μ$ ε^{2/k}})$样本,其中$ k = \ left \ lceil(\logε)/(\logγ)\ right \ rceil $。我们表明该样本复杂性本质上是最佳的。我们的边界表明,样品复杂性不需要随所需的误差参数$ε$统一变化:对于某些$ε$的值,其值的扰动对样品复杂性没有渐近效应,而对于其他值,其值的任何降低都会导致其渐近较大的样品复杂性。
In classical statistics and distribution testing, it is often assumed that elements can be sampled from some distribution $P$, and that when an element $x$ is sampled, the probability $P$ of sampling $x$ is also known. Recent work in distribution testing has shown that many algorithms are robust in the sense that they still produce correct output if the elements are drawn from any distribution $Q$ that is sufficiently close to $P$. This phenomenon raises interesting questions: under what conditions is a "noisy" distribution $Q$ sufficient, and what is the algorithmic cost of coping with this noise? We investigate these questions for the problem of estimating the sum of a multiset of $N$ real values $x_1, \ldots, x_N$. This problem is well-studied in the statistical literature in the case $P = Q$, where the Hansen-Hurwitz estimator is frequently used. We assume that for some known distribution $P$, values are sampled from a distribution $Q$ that is pointwise close to $P$. For every positive integer $k$ we define an estimator $ζ_k$ for $μ= \sum_i x_i$ whose bias is proportional to $γ^k$ (where our $ζ_1$ reduces to the classical Hansen-Hurwitz estimator). As a special case, we show that if $Q$ is pointwise $γ$-close to uniform and all $x_i \in \{0, 1\}$, for any $ε> 0$, we can estimate $μ$ to within additive error $εN$ using $m = Θ({N^{1-\frac{1}{k}} / ε^{2/k}})$ samples, where $k = \left\lceil (\log ε)/(\log γ)\right\rceil$. We show that this sample complexity is essentially optimal. Our bounds show that the sample complexity need not vary uniformly with the desired error parameter $ε$: for some values of $ε$, perturbations in its value have no asymptotic effect on the sample complexity, while for other values, any decrease in its value results in an asymptotically larger sample complexity.