论文标题
了解Azure Cosmos DB与TLA+的不一致性
Understanding Inconsistency in Azure Cosmos DB with TLA+
论文作者
论文摘要
除了实现分布式系统的正确性之外,确切了解用户应该从该系统中看到的内容同样重要。即使系统本身可以按设计,对其用户可见语义的理解不足也可能导致其依赖性错误。通过将正式的规格集中在微软的Azure Cosmos DB服务的预期面向用户的行为上,我们能够对数据库进行正式规范,该数据库比COSMOS DB的其他任何其他规范都要小得多,并且比任何其他有效的用户行为范围更广泛的范围要比现有的范围更广泛。我们记录的许多其他行为以前在Cosmos DB开发团队之外也很少理解,甚至非正式地导致了依赖它的Microsoft产品的数据一致性错误。使用此模型,我们能够在Cosmos DB的公开文档中提出两个关键问题,此后已经解决了这些问题。我们还能够为另一个依赖宇宙数据库的Azure服务中以前的高影响力中断提供基本解决方案。
Beyond implementation correctness of a distributed system, it is equally important to understand exactly what users should expect to see from that system. Even if the system itself works as designed, insufficient understanding of its user-visible semantics can cause bugs in its dependencies. By focusing a formal specification effort on precisely defining the expected user-facing behaviors of the Azure Cosmos DB service at Microsoft, we were able to write a formal specification of the database that was significantly smaller and conceptually simpler than any other specification of Cosmos DB, while representing a wider range of valid user-observable behaviors than existing more detailed specifications. Many of the additional behaviors we documented were previously poorly understood outside of the Cosmos DB development team, even informally, leading to data consistency errors in Microsoft products that depend on it. Using this model, we were able to raise two key issues in Cosmos DB's public-facing documentation, which have since been addressed. We were also able to offer a fundamental solution to a previous high-impact outage within another Azure service that depends on Cosmos DB.