论文标题
您需要更多数据吗?因果探索
Is More Data All You Need? A Causal Exploration
论文作者
论文摘要
策划用于机器学习应用程序的大规模医学成像数据集既耗时又昂贵。对于机器学习从业者,尤其是在时间限制下,很难平衡模型开发,数据收集和注释之间的工作量。因果分析通常用于医学和经济学中,以获取有关行动和政策影响的见解。在本文中,我们探讨了数据集干预对图像分类模型输出的影响。通过因果方法,我们研究了需要将数据纳入数据集的数据和类型的效果,以实现特定子任务的更好性能。本文的主要目标是强调因果分析作为用于开发医学成像ML应用程序资源优化的工具的潜力。我们使用合成数据集和用于糖尿病性视网膜病变图像分析的示例性用例探索这个概念。
Curating a large scale medical imaging dataset for machine learning applications is both time consuming and expensive. Balancing the workload between model development, data collection and annotations is difficult for machine learning practitioners, especially under time constraints. Causal analysis is often used in medicine and economics to gain insights about the effects of actions and policies. In this paper we explore the effect of dataset interventions on the output of image classification models. Through a causal approach we investigate the effects of the quantity and type of data we need to incorporate in a dataset to achieve better performance for specific subtasks. The main goal of this paper is to highlight the potential of causal analysis as a tool for resource optimization for developing medical imaging ML applications. We explore this concept with a synthetic dataset and an exemplary use-case for Diabetic Retinopathy image analysis.