论文标题
多任务回归预测的样品大小相关的协变量的相关公共子集的最佳选择
Optimal selection of sample-size dependent common subsets of covariates for multi-task regression prediction
论文作者
论文摘要
给分析师提供了一个由不同尺寸的回归数据集组成的培训集,该培训集是根据某些$ g_j $,$ j = 1,\ cal j $进行分配的,\ cal j $,在其中假定分布$ g_j $形成由某些常见源生成的随机样品。特别是,$ d_j $有一组共同的协变量,它们都被标记了。分析师使用训练集用于选择由$ {p}^*(n)$表示的协变量子集的选择,其作用是下一步的。我们考虑的多任务问题如下:给定许多随机标记的数据集(是否在培训集中)$ d_ {j_k} $ size $ n_k $的$ n_k $,$ k = 1,\ ldots,k $,k $,k $估算每个数据集对每个数据集的回归量,然后预测了$ variate $ nfuret $ covariat $ and n Fureters $ and n ivaria n Fureters $ c}^p^k}^p i {p^^p^^p ivariia futiveR in variia futiveR}^p。他们的协变量。自然,大型样本量$ n_k $的$ d_ {j_k} $允许更大的协变量子集,并且需要对所选协变量子集的大小对$ n_k $的依赖性,以实现良好的预测并避免过度拟合。众所周知,子集选择是困难的,并且在计算要求上需要大量样本。在培训中,使用所有回归数据集都相当于在合适的假设下借用强度朝更好的选择。此外,使用普通子集用于具有给定样本量标准化的所有回归,并简化数据收集,并避免为每个预测任务选择和使用其他子集。当相关的预测协变量在不同的回归中常见时,我们的方法是有效的,而模型的系数可能在不同的回归之间有所不同。
An analyst is given a training set consisting of regression datasets $D_j$ of different sizes, which are distributed according to some $G_j$, $j=1,\ldots,\cal J$, where the distributions $G_j$ are assumed to form a random sample generated by some common source. In particular, the $D_j$'s have a common set of covariates and they are all labeled. The training set is used by the analyst for selection of subsets of covariates denoted by ${P}^*(n)$, whose role is described next. The multi-task problem we consider is as follows: given a number of random labeled datasets (which may be in the training set or not) $D_{J_k}$ of size $n_k$, $k=1,\ldots,K$, estimate separately for each dataset the regression coefficients on the subset of covariates ${P}^*(n_k)$ and then predict future dependent variables given their covariates. Naturally, a large sample size $n_k$ of $D_{J_k}$ allows a larger subset of covariates, and the dependence of the size of the selected covariate subsets on $n_k$ is needed in order to achieve good prediction and avoid overfitting. Subset selection is notoriously difficult and computationally demanding, and requires large samples; using all the regression datasets in the training set together amounts to borrowing strength toward better selection under suitable assumptions. Furthermore, using common subsets for all regressions having a given sample size standardizes and simplifies the data collection and avoids having to select and use a different subset for each prediction task. Our approach is efficient when the relevant covariates for prediction are common to the different regressions, while the models' coefficients may vary between different regressions.