论文标题
frappe:$ \ usewissline {\ text {f}} $ ast $ \ usepline {\ text {ra}} $ nk $ \ usepline {\ text {app}} $ roximation with $ \ usevenline {
FRAPPE: $\underline{\text{F}}$ast $\underline{\text{Ra}}$nk $\underline{\text{App}}$roximation with $\underline{\text{E}}$xplainable Features for Tensors
论文作者
论文摘要
张量分解已被证明有效地分析了多维数据的结构。但是,这些方法中的大多数都需要一个关键参数:所需组件的数量。在CandeComp/Parafac分解(CPD)的情况下,组件数量的理想值称为规范等级,并极大地影响了分解结果的质量。现有方法使用启发式方法或贝叶斯方法通过反复计算CPD来估计该值,从而使它们在计算上非常昂贵。在这项工作中,我们提出了Frappe,这是估计张量的规范等级而无需计算CPD的第一种方法。该方法是两个关键想法的结果。首先,与计算CPD相比,生成具有已知等级的合成数据要便宜得多。其次,我们可以通过生成匹配给定输入张量的合成数据来大大提高模型的概括能力和速度。然后,我们可以在一套符合给定输入张量的合成张量上训练专门的一次性回归模型,并使用它来估计张量的规范等级 - 所有这些都不计算昂贵的CPD。 Frappe的速度比表现最好的基线快24倍,并且在合成数据集上的MAPE提高了10%。它的性能也比现实世界数据集上的基准效果且更好。
Tensor decompositions have proven to be effective in analyzing the structure of multidimensional data. However, most of these methods require a key parameter: the number of desired components. In the case of the CANDECOMP/PARAFAC decomposition (CPD), the ideal value for the number of components is known as the canonical rank and greatly affects the quality of the decomposition results. Existing methods use heuristics or Bayesian methods to estimate this value by repeatedly calculating the CPD, making them extremely computationally expensive. In this work, we propose FRAPPE, the first method to estimate the canonical rank of a tensor without having to compute the CPD. This method is the result of two key ideas. First, it is much cheaper to generate synthetic data with known rank compared to computing the CPD. Second, we can greatly improve the generalization ability and speed of our model by generating synthetic data that matches a given input tensor in terms of size and sparsity. We can then train a specialized single-use regression model on a synthetic set of tensors engineered to match a given input tensor and use that to estimate the canonical rank of the tensor - all without computing the expensive CPD. FRAPPE is over 24 times faster than the best-performing baseline and exhibits a 10% improvement in MAPE on a synthetic dataset. It also performs as well as or better than the baselines on real-world datasets.