论文标题
计算批发和零售数据市场中时空数据的相对价值
Computing the Relative Value of Spatio-Temporal Data in Wholesale and Retail Data Marketplaces
论文作者
论文摘要
时空信息用于驱动大量智能运输,智能城市和人群传感应用程序。由于数据现在被认为是宝贵的生产因素,因此数据市场似乎可以帮助个人和企业将其推向市场,以满足不断增长的需求。在这样的市场中,可能需要组合数据以满足不同应用程序的要求。在本文中,我们研究了在批发和零售市场中估计不同时空数据集的相对价值的问题,目的是预测大都市地区的需求。使用案例研究,我们从芝加哥和纽约出租车的大型出租车乘车数据集,我们提出诸如“何时对不同的出租车公司结合数据有意义的问题?”和“如何对他们共享的数据进行补偿?”。然后,我们将注意力转移到建立个人驱动程序带给零售市场的数据相对价值的更困难的问题。总体而言,我们表明,简单而流行的方法来估计数据的相对价值,例如使用音量或````一个删除''''''启发式启发式方法是不准确的。取而代之的是,如果希望捕获混合不同数据集对预测算法的准确性的复杂效果,则需要采用更复杂的价值概念,例如Shapley价值。当然,将Shapley值应用于许多来源的大型数据集,当然在计算上具有挑战性。我们求助于结构化抽样,并设法准确计算数千个数据源的重要性。我们表明,不同出租车公司和驾驶员持有的数据的相对价值可能会有很大差异,并且其相对排名可能会在大都市区域内的地区变化。
Spatio-temporal information is used for driving a plethora of intelligent transportation, smart-city, and crowd-sensing applications. Since data is now considered a valuable production factor, data marketplaces have appeared to help individuals and enterprises bring it to market to satisfy the ever-growing demand. In such marketplaces, several sources may need to combine their data in order to meet the requirements of different applications. In this paper we study the problem of estimating the relative value of different spatio-temporal datasets combined in wholesale and retail marketplaces for the purpose of predicting demand in metropolitan areas. Using as case studies large datasets of taxi rides from Chicago and New York, we ask questions such as "When does it make sense for different taxi companies to combine their data?", and "How should different companies be compensated for the data that they share?". We then turn our attention to the even harder problem of establishing the relative value of the data brought to retail marketplaces by individual drivers. Overall, we show that simplistic but popular approaches for estimating the relative value of data, such as using volume, or the ``leave-one-out'' heuristic, are inaccurate. Instead, more complex notions of value from economics and game-theory, such as the Shapley value need to be employed if one wishes to capture the complex effects of mixing different datasets on the accuracy of forecasting algorithms. Applying the Shapley value to large datasets from many sources is, of course, computationally challenging. We resort to structured sampling and manage to compute accurately the importance of thousands of data sources. We show that the relative value of the data held by different taxi companies and drivers may differ substantially, and that its relative ranking may change from district to district within a metropolitan area.