解释高维数据集的黑盒机器学习模型

论文标题

解释高维数据集的黑盒机器学习模型

Interpreting Black-box Machine Learning Models for High Dimensional Datasets

论文作者

Karim, Md. Rezaul, Shajalal, Md., Graß, Alex, Döhmen, Till, Chala, Sisay Adugna, Boden, Alexander, Beecks, Christian, Decker, Stefan

论文摘要

由于其在建模复杂问题和处理高维数据集的有效性，因此已显示深神经网络（DNN）在广泛的应用领域中的传统机器学习算法优于传统的机器学习算法。但是，许多现实生活中的数据集具有越来越高的维度，其中大量功能可能与受监督和无监督的学习任务无关。包含此类功能不仅会引入不良的噪声，还会提高计算复杂性。此外，由于许多特征之间的非线性和依赖性高，DNN模型往往不可避免地是不透明的，并且被视为黑盒方法，因为它们的内部功能不佳。他们的算法复杂性通常仅仅超出了人类的能力，可以理解无数超级参数之间的相互作用。井井有条的模型可以识别具有统计学意义的特征，并解释其影响模型结果的方式。在本文中，我们提出了一种有效的方法，可以在高维数据集的情况下提高黑框模型对分类任务的解释性。首先，我们在高维数据集上训练黑框模型，以了解执行分类的嵌入。为了分解黑框模型的内部工作原理并确定TOP-K的重要特征，我们采用了不同的探测和扰动技术。然后，我们通过在TOP-K特征空间上通过可解释的替代模型来近似黑框模型的行为。最后，我们从替代模型中得出了决策规则和本地解释，以解释个人决策。当在不同的数据集上进行测试，其尺寸在50到20,000 w.r.t指标和解释性之间，我们的方法在不同数据集上进行测试时，胜过TabNet和XGBoost等最先进的方法。

Deep neural networks (DNNs) have been shown to outperform traditional machine learning algorithms in a broad variety of application domains due to their effectiveness in modeling complex problems and handling high-dimensional datasets. Many real-life datasets, however, are of increasingly high dimensionality, where a large number of features may be irrelevant for both supervised and unsupervised learning tasks. The inclusion of such features would not only introduce unwanted noise but also increase computational complexity. Furthermore, due to high non-linearity and dependency among a large number of features, DNN models tend to be unavoidably opaque and perceived as black-box methods because of their not well-understood internal functioning. Their algorithmic complexity is often simply beyond the capacities of humans to understand the interplay among myriads of hyperparameters. A well-interpretable model can identify statistically significant features and explain the way they affect the model's outcome. In this paper, we propose an efficient method to improve the interpretability of black-box models for classification tasks in the case of high-dimensional datasets. First, we train a black-box model on a high-dimensional dataset to learn the embeddings on which the classification is performed. To decompose the inner working principles of the black-box model and to identify top-k important features, we employ different probing and perturbing techniques. We then approximate the behavior of the black-box model by means of an interpretable surrogate model on the top-k feature space. Finally, we derive decision rules and local explanations from the surrogate model to explain individual decisions. Our approach outperforms state-of-the-art methods like TabNet and XGboost when tested on different datasets with varying dimensionality between 50 and 20,000 w.r.t metrics and explainability.

下载PDF全文

下载文献需遵守相关版权规定

论文标题