改善高维分类数据的组套索

论文标题

改善高维分类数据的组套索

Improving Group Lasso for high-dimensional categorical data

论文作者

Nowakowski, Szymon, Pokarowski, Piotr, Rejchel, Wojciech, Sołtys, Agnieszka

论文摘要

即使对于中等数量的变量，使用分类数据的稀疏建模或模型选择也有挑战性，因为大致需要一个参数来编码一个类别或级别。该组拉索是一种用于选择连续或分类变量的众所周知的有效算法，但是与选定因子相关的所有估计通常都不同。因此，拟合的模型可能不会稀疏，这使得模型的解释变得困难。为了获得组套索的稀疏解决方案，我们提出以下两步程序：首先，我们使用组套索降低数据维度；然后，为了选择最终模型，我们使用信息标准，这些信息标准是通过单个因素的聚类来制备的小型模型家族。我们在稀疏的高维情况下研究算法的选择正确性。我们还测试了合成和实际数据集的方法，并表明它在预测准确性或模型维度方面的性能比最先进的算法状态更好。

Sparse modelling or model selection with categorical data is challenging even for a moderate number of variables, because one parameter is roughly needed to encode one category or level. The Group Lasso is a well known efficient algorithm for selection continuous or categorical variables, but all estimates related to a selected factor usually differ. Therefore, a fitted model may not be sparse, which makes the model interpretation difficult. To obtain a sparse solution of the Group Lasso we propose the following two-step procedure: first, we reduce data dimensionality using the Group Lasso; then to choose the final model we use an information criterion on a small family of models prepared by clustering levels of individual factors. We investigate selection correctness of the algorithm in a sparse high-dimensional scenario. We also test our method on synthetic as well as real datasets and show that it performs better than the state of the art algorithms with respect to the prediction accuracy or model dimension.

下载PDF全文

下载文献需遵守相关版权规定

论文标题