论文标题
通过完全贝叶斯改进的姓氏地理编码和名称补品解决种族归因的人口普查数据问题
Addressing Census data problems in race imputation via fully Bayesian Improved Surname Geocoding and name supplements
论文作者
论文摘要
对个人种族和种族的预测在社会科学和公共卫生研究中起着重要作用。例子包括对健康和投票中种族差异的研究。最近,贝叶斯改进的姓氏地理编码(BISG)使用贝叶斯的规则将人口普查姓氏文件中的信息与个人住所的地理编码结合起来,已成为这项预测任务的主要方法。不幸的是,BISG遭受了两个人口普查数据问题,导致少数民族的预测绩效不令人满意。首先,在这些群体中一些成员居住的人口普查块中,十年级人口普查通常包含少数族裔种族群体的零计数。其次,因为人口普查文件仅包含频繁的名称,所以列表中缺少许多姓氏,尤其是少数族裔的姓氏。为了解决零计数问题,我们引入了完全贝叶斯改进的姓氏地理编码(FBISG)方法,该方法可以通过将BISG方法的天真贝叶斯推断扩展到完全后验推断,从而解决了人口普查计数的潜在测量误差。为了解决丢失的姓氏问题,我们将人口普查姓氏数据补充了有关上一个,第一个和中间名的其他数据,从六个南部州的选民文件中获取,在那里有自我报告的种族可用。我们的经验验证表明,FBISG方法论和名称补充剂可显着提高所有种族群体,尤其是亚洲人的种族归因的准确性。提出的方法以及其他名称数据可通过开源软件WRU获得。
Prediction of individual's race and ethnicity plays an important role in social science and public health research. Examples include studies of racial disparity in health and voting. Recently, Bayesian Improved Surname Geocoding (BISG), which uses Bayes' rule to combine information from Census surname files with the geocoding of an individual's residence, has emerged as a leading methodology for this prediction task. Unfortunately, BISG suffers from two Census data problems that contribute to unsatisfactory predictive performance for minorities. First, the decennial Census often contains zero counts for minority racial groups in the Census blocks where some members of those groups reside. Second, because the Census surname files only include frequent names, many surnames -- especially those of minorities -- are missing from the list. To address the zero counts problem, we introduce a fully Bayesian Improved Surname Geocoding (fBISG) methodology that accounts for potential measurement error in Census counts by extending the naive Bayesian inference of the BISG methodology to full posterior inference. To address the missing surname problem, we supplement the Census surname data with additional data on last, first, and middle names taken from the voter files of six Southern states where self-reported race is available. Our empirical validation shows that the fBISG methodology and name supplements significantly improve the accuracy of race imputation across all racial groups, and especially for Asians. The proposed methodology, together with additional name data, is available via the open-source software WRU.