群集结构功能

论文标题

群集结构功能

The cluster structure function

论文作者

Cohen, Andrew R., Vitányi, Paul M. B.

论文摘要

对于数据集的每个分区中，都有一个分区，使每个部分都可以在该部分中为数据提供一个好的模型（“算法足够的统计量”）。由于可以对一个数据和数据数量之间的每个数字完成此操作，因此结果是函数，即群集结构函数。它将分区部分的零件数映射到与零件成为良好模型的缺陷相关的值。这样的函数从一个值开始至少为零，对于数据集的任何分区，而下降到零以零以将数据集的分区分配到Singleton零件中。最佳聚类是选择最小化群集结构函数的聚类。该方法背后的理论用算法信息理论（Kolmogorov复杂性）表示。在实践中，涉及的kolmogorov复杂性由混凝土压缩机近似。我们使用实际数据集给出了示例：MNIST手写数字和干细胞研究中使用的真实细胞的分割。

For each partition of a data set into a given number of parts there is a partition such that every part is as much as possible a good model (an "algorithmic sufficient statistic") for the data in that part. Since this can be done for every number between one and the number of data, the result is a function, the cluster structure function. It maps the number of parts of a partition to values related to the deficiencies of being good models by the parts. Such a function starts with a value at least zero for no partition of the data set and descents to zero for the partition of the data set into singleton parts. The optimal clustering is the one chosen to minimize the cluster structure function. The theory behind the method is expressed in algorithmic information theory (Kolmogorov complexity). In practice the Kolmogorov complexities involved are approximated by a concrete compressor. We give examples using real data sets: the MNIST handwritten digits and the segmentation of real cells as used in stem cell research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题