论文标题
可视化大规模和高维数据的更精细的集群结构
Visualizing the Finer Cluster Structure of Large-Scale and High-Dimensional Data
论文作者
论文摘要
由于数据科学中大型数据库的快速增长,高维数据的尺寸降低和可视化已成为非常重要的研究主题。在本文中,我们建议使用广义的Sigmoid函数对高维空间和低维空间的距离相似性进行建模。特别地,将参数B引入低维空间中的广义Sigmoid函数,因此我们可以通过更改B的值来调整功能尾部的重度。使用模拟和现实世界数据集,我们表明我们提出的方法可以生成可视化结果,可与统一的歧管近似和投影(UMAP)相当,该结果是一种新开发的多种流形学习技术,具有快速运行速度,更好的全局结构,并且可扩展性与大量数据集。此外,根据研究和数据结构的目的,我们可以降低或增加B的值,以揭示数据的簇结构较细的群集结构,或者保持嵌入的邻域连续性以更好地可视化。最后,我们使用域知识来证明以少数B值揭示的更精细的子截面是有意义的。
Dimension reduction and visualization of high-dimensional data have become very important research topics because of the rapid growth of large databases in data science. In this paper, we propose using a generalized sigmoid function to model the distance similarity in both high- and low-dimensional spaces. In particular, the parameter b is introduced to the generalized sigmoid function in low-dimensional space, so that we can adjust the heaviness of the function tail by changing the value of b. Using both simulated and real-world data sets, we show that our proposed method can generate visualization results comparable to those of uniform manifold approximation and projection (UMAP), which is a newly developed manifold learning technique with fast running speed, better global structure, and scalability to massive data sets. In addition, according to the purpose of the study and the data structure, we can decrease or increase the value of b to either reveal the finer cluster structure of the data or maintain the neighborhood continuity of the embedding for better visualization. Finally, we use domain knowledge to demonstrate that the finer subclusters revealed with small values of b are meaningful.