蛋白质中学习的索引：用嵌入和聚类技术代替复杂距离计算的扩展工作

论文标题

蛋白质中学习的索引：用嵌入和聚类技术代替复杂距离计算的扩展工作

Learned Indexing in Proteins: Extended Work on Substituting Complex Distance Calculations with Embedding and Clustering Techniques

论文作者

Oľha, Jaroslav, Slanináková, Terézia, Gendiar, Martin, Antol, Matej, Dohnal, Vlastislav

论文摘要

尽管相似性搜索研究的不断发展，但它仍然面临着由于数据的复杂性而面临的相同挑战，例如维数和计算昂贵的距离函数的诅咒。事实证明，各种机器学习技术能够用简单的线性功能组合替换精美的数学模型，通常以正式保证的准确性和正确的查询性能来获得速度和简单性。作者通过为3D蛋白质结构搜索的复杂问题提供了轻巧的解决方案来探索这一研究趋势的潜力。该解决方案由三个步骤组成：（i）将3D蛋白结构信息转换为非常紧凑的向量，（ii）使用概率模型对这些向量进行分组并通过返回给定数量的类似对象来响应查询，以及（iii）最终的过滤步骤，该步骤应用基本矢量距离来完善结果。

Despite the constant evolution of similarity searching research, it continues to face the same challenges stemming from the complexity of the data, such as the curse of dimensionality and computationally expensive distance functions. Various machine learning techniques have proven capable of replacing elaborate mathematical models with combinations of simple linear functions, often gaining speed and simplicity at the cost of formal guarantees of accuracy and correctness of querying. The authors explore the potential of this research trend by presenting a lightweight solution for the complex problem of 3D protein structure search. The solution consists of three steps -- (i) transformation of 3D protein structural information into very compact vectors, (ii) use of a probabilistic model to group these vectors and respond to queries by returning a given number of similar objects, and (iii) a final filtering step which applies basic vector distance functions to refine the result.

下载PDF全文

下载文献需遵守相关版权规定

论文标题