论文标题
校准:宽桌三角洲分析的简单技巧
Calibration: A Simple Trick for Wide-table Delta Analytics
论文作者
论文摘要
标准化数据库的数据分析通常需要计算和实现昂贵的连接(宽表)。分解的查询执行模型执行作为消息传递在JOIN图中的关系之间传递消息,并通过加入将聚合推动以减少中间结果大小。尽管这加速了查询执行,但它仅优化了单个宽桌查询。相比之下,宽桌分析通常是交互式的,用户希望将三角洲应用于初始查询结构。例如,用户想切片,骰子和钻孔尺寸,更新一部分表,并加入新的表以进行丰富。如此宽桌三角洲分析提供了新颖的工作共享机会。这项工作表明,与分解的执行相比,在查询执行过程中仔细实现的消息可以加速宽桌三角洲分析,而只能造成恒定因子开销。关键的挑战是消息对消息传递的顺序敏感。为了应对这一挑战,我们借用了概率图形模型中校准的概念,以实现足够的消息以支持任何订购。我们在新颖的校准结hypree(CJT)数据结构中表现出这些想法,该数据结构很快构建,积极地使用消息来加速未来的查询,并且在更新下可以逐步维护。我们进一步展示了CJT如何使ML的OLAP,查询说明,流数据和数据增强等应用程序受益。我们的实验评估了在云DB和PANDAS上以单线读取的自定义引擎运行的三个版本的CJT版本,并在上述应用程序上显示了比最先进的分解执行算法的30倍-10^5倍。
Data analytics over normalized databases typically requires computing and materializing expensive joins (wide-tables). Factorized query execution models execution as message passing between relations in the join graph and pushes aggregations through joins to reduce intermediate result sizes. Although this accelerates query execution, it only optimizes a single wide-table query. In contrast, wide-table analytics is usually interactive and users want to apply delta to the initial query structure. For instance, users want to slice, dice and drill-down dimensions, update part of the tables and join with new tables for enrichment. Such Wide-table Delta Analytics offers novel work-sharing opportunities. This work shows that carefully materializing messages during query execution can accelerate Wide-table Delta Analytics by >10^5x as compared to factorized execution, and only incurs a constant factor overhead. The key challenge is that messages are sensitive to the message passing ordering. To address this challenge, we borrow the concept of calibration in probabilistic graphical models to materialize sufficient messages to support any ordering. We manifest these ideas in the novel Calibrated Junction Hypertree (CJT) data structure, which is fast to build, aggressively re-uses messages to accelerate future queries, and is incrementally maintainable under updates. We further show how CJTs benefit applications such as OLAP, query explanation, streaming data, and data augmentation for ML. Our experiments evaluate three versions of the CJT that run in a single-threaded custom engine, on cloud DBs, and in Pandas, and show 30x - 10^5x improvements over state-of-the-art factorized execution algorithms on the above applications.