论文标题
医院再入院预测中的噪声污染:通过增强学习的长文档分类
Noise Pollution in Hospital Readmission Prediction: Long Document Classification with Reinforcement Learning
论文作者
论文摘要
本文提出了一种增强学习方法,可以在长临床文档中提取噪音,以完成肾脏移植后再入院预测的任务。我们面临着在一个小数据集上开发健壮模型的挑战,每个文档可能包含超过10k的令牌,其中包含噪音,包括表格文本和任务irrrelevant句子。我们首先实验了四种类型的编码器,以实证确定最佳文档表示形式,然后应用强化学习以从长文档中删除嘈杂的文本,从而将噪声提取过程模拟为顺序的决策问题。我们的结果表明,旧的词袋编码器的表现优于基于深度学习的编码器,并且强化学习能够在基线上改进,同时修剪25%的文本段。我们的分析描述了强化学习能够识别典型的嘈杂令牌和特定于任务的嘈杂文本。
This paper presents a reinforcement learning approach to extract noise in long clinical documents for the task of readmission prediction after kidney transplant. We face the challenges of developing robust models on a small dataset where each document may consist of over 10K tokens with full of noise including tabular text and task-irrelevant sentences. We first experiment four types of encoders to empirically decide the best document representation, and then apply reinforcement learning to remove noisy text from the long documents, which models the noise extraction process as a sequential decision problem. Our results show that the old bag-of-words encoder outperforms deep learning-based encoders on this task, and reinforcement learning is able to improve upon baseline while pruning out 25% text segments. Our analysis depicts that reinforcement learning is able to identify both typical noisy tokens and task-specific noisy text.