论文标题
无重复:高度集中的哈希快速流媒
No Repetition: Fast Streaming with Highly Concentrated Hashing
论文作者
论文摘要
为了使估计器在具有很高概率的某个误差中起作用,一种常见的策略是设计一个具有恒定概率的作用,然后使用独立重复来提高概率。这种方法的重要示例是用于估计流中不同元素数量或估计大集合之间的集合相似性的小空间算法。使用标准通用哈希进行处理每个元素,我们将获得一个基于草图的估计器,例如,误差太大的概率为1/4。通过执行$ r $独立的重复并将估算器的中位数呈现,错误概率呈$ r $呈指数下降。但是,运行$ r $独立的实验将处理时间增加了一个因子$ r $。 在这里,我们指出,如果我们具有浓度强的哈希函数,那么我们获得相同的高概率边界,而无需重复。我们有一个$ r $ $倍的单个草图,而不是$ r $独立的草图,因此总空间相同。但是,我们仅应用一个单一的哈希功能,因此我们节省了一个因子$ r $,而整个算法变得更简单。 Aamand Em等人最近提出了具有强浓度界限的快速实用哈希功能。 (出现在Stoc 2020)。因此,使用它们的哈希方案,算法变得非常快速,实用,适合在线处理大量数据流。
To get estimators that work within a certain error bound with high probability, a common strategy is to design one that works with constant probability, and then boost the probability using independent repetitions. Important examples of this approach are small space algorithms for estimating the number of distinct elements in a stream, or estimating the set similarity between large sets. Using standard strongly universal hashing to process each element, we get a sketch based estimator where the probability of a too large error is, say, 1/4. By performing $r$ independent repetitions and taking the median of the estimators, the error probability falls exponentially in $r$. However, running $r$ independent experiments increases the processing time by a factor $r$. Here we make the point that if we have a hash function with strong concentration bounds, then we get the same high probability bounds without any need for repetitions. Instead of $r$ independent sketches, we have a single sketch that is $r$ times bigger, so the total space is the same. However, we only apply a single hash function, so we save a factor $r$ in time, and the overall algorithms just get simpler. Fast practical hash functions with strong concentration bounds were recently proposed by Aamand em et al. (to appear in STOC 2020). Using their hashing schemes, the algorithms thus become very fast and practical, suitable for online processing of high volume data streams.