论文标题
大数据和(甚至不是很复杂的生态模型):当世界碰撞时
Large Data and (Not Even Very) Complex Ecological Models: When Worlds Collide
论文作者
论文摘要
我们考虑将复杂的生态模型拟合到“大型”数据集时会出现的挑战。特别是,我们专注于通常用于描述个体异质性的随机效应模型,这些模型通常存在于所研究的生态种群中。通常,这些模型导致的可能性仅作为分析性棘手的积分表达。将此类模型拟合到数据的通用技术包括,例如使用数值近似值或贝叶斯数据增强方法。但是,随着数据集的大小增加(即个体的数量增加),这些计算工具在计算上可能不可行。我们提出了一种有效的贝叶斯模型拟合方法,在校正该样本以获取完整数据集的后验分布之前,我们最初从较小的数据子样本的后部分布中进行采样,并使用重要的采样方法。我们考虑了几个实际问题,包括子采样机制,计算效率(包括平行算法的能力)和使用多个子采样数据集结合了子采样估计。我们证明了与各个异质性捕获征用模型有关的方法。我们最初通过模拟数据证明了该方法的可行性,然后再考虑大约30,000个Guillemots的具有挑战性的实际数据集,并在大幅减少的计算时间中获得后验估计。
We consider the challenges that arise when fitting complex ecological models to 'large' data sets. In particular, we focus on random effect models which are commonly used to describe individual heterogeneity, often present in ecological populations under study. In general, these models lead to a likelihood that is expressible only as an analytically intractable integral. Common techniques for fitting such models to data include, for example, the use of numerical approximations for the integral, or a Bayesian data augmentation approach. However, as the size of the data set increases (i.e. the number of individuals increases), these computational tools may become computationally infeasible. We present an efficient Bayesian model-fitting approach, whereby we initially sample from the posterior distribution of a smaller subsample of the data, before correcting this sample to obtain estimates of the posterior distribution of the full dataset, using an importance sampling approach. We consider several practical issues, including the subsampling mechanism, computational efficiencies (including the ability to parallelise the algorithm) and combining subsampling estimates using multiple subsampled datasets. We demonstrate the approach in relation to individual heterogeneity capture-recapture models. We initially demonstrate the feasibility of the approach via simulated data before considering a challenging real dataset of approximately 30,000 guillemots, and obtain posterior estimates in substantially reduced computational time.