micahlerner.com
Log-structured Protocols in Delos
Published November 23, 2021
Found something wrong? Submit a pull request!
The papers over the next few weeks will be from SOSP. As always, feel free to reach out on Twitter with feedback or suggestions about papers to read! These paper reviews can be delivered weekly to your inbox, or you can subscribe to the Atom feed.
Log-structured Protocols in Delos
This week’s paper review, “Log-structured Protocols in Delos” discusses a critical component of Delos, Facebook’s system for storing control plane data, like scheduler metadata and configurationA previous paper on the system, Virtual Consensus in Delos, won a best paper award at OSDI 2020. There are great overviews of this paper from Murat Demirbas and The Morning Paper, and a great talk from Mahesh at @scale. - according to the authors, Delos is replacing ZookeeperFrom the Zookeeper site: “ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them, which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.” inside of FacebookThe authors call the Zookeeper implementation on Delos Zelos - more on this later on in this paper review. .
Storage systems for control plane data are placed under different constraints than systems that store application data - for example, control plane systems must be highly available and strive for zero-dependencies. At the same time, it is not sufficient to provide a single API (like a simple key-value store) for control plane databases, meaning that several systems need to be implemented according to these requirements. Delos aims to limit duplicate solutions to the problems that control plane databases face by providing a common platform for control plane databases.
A key feature of Delos is a replicated log - many systems use log replication to maintain multiple copies of a dataset or to increase fault toleranceTwo examples are MySQL replication, covered in previous papers reviews on TAO: Facebook’s Distributed Data Store for the Social Graph and Scaling Memcache at Facebook. . Consumers of a replicated log execute logic on the data in log entries to produce a “state of the world”. Each node that has consumed the log up to the same point will have the same “state of the world” (assuming that the log consumption code is deterministic!). The name for this technique is state machine replicationThe paper cites Raft, which I wrote about in a previous paper review. I also highly recommend like this overview from Eli Bendersky. (aka SMR).
The authors note that many systems taking advantage of state machine replication unnecessarily re-implement similar functionality (like batching writes to the log). To enable code reuse, Delos implements common functionality in reusable building blocks that run under higher-level, application-specific logic. The authors call this stack-like approach log-structured protocols, and discuss how the technique simplifies the development and deployment of SMR systems through code-reuse, upgradability, and implementation flexibility.
What are the paper’s contributions?’
The paper makes three main contributions: the design for log-structured protocols, implementations of nine log-structured protocols and two production databases using the abstraction, and the evaluation of the implementations scaled to a production environment.
Log Structured Protocol Design
Each log-structured protocol has four primary components:
- Application logic: unique functionality that often represents the interface between the replicated state machine and an external system. On example is application logic that converts log entries into SQL statements that write to a database table.
- Engines: implement common functionality like batching writes to the log or backing up log entries to external storage. More information on the various engines in a later section.
- Local store: contains the state of the world. Engines and application logic read/write to the local store, which is implemented using RocksDB.
- Shared log: the lowest level of the stack. A common base engine handles writes and reads to the shared log.
Engines are a key building block of each log-structured protocol - they allow developers to compose existing functionality and to focus on implementing a small set of custom logic.
Each engine interacts with the layers above or below through an API that relies on proposals:
propose
, used to send messages down the stack, towards the shared log.apply
, used by lower level engines to transfer messages up the stack.
While responding to calls, the engines can also read or write to the LocalStore, which maintains the current state of the system. Additional calls setup the layering in a log-structured protocol (registerUpcall
), coordinate trimming the log (setTrimPrefix
), request all entries from a lower level engine (sync
), and allow an engine to respond to events (using a postApply
callback).
Two Databases and Nine Engines
In addition to outlining the structure of log-structured protocols, the paper describes the implementation of a set of databases and engines using the approach.
Databases
The paper discusses the implementation of two databases using the Delos infrastructure: DelosTable and Zelos.
Existing research from FB describes how DelosTable, “offers a rich API, with support for transactions, secondary indexes, and range queries. It provides strong guarantees on consistency, durability, and availability.” DelosTable is used in Facebook’s, “Tupperware Resource BrokerI haven’t had a chance to read it yet, but Tupperware is mentioned in Facebook’s paper on resource management - Twine: A Unified Cluster Management System for Shared Infrastructure. , which maintains a ledger of all machines in our data centers and their allocation status”.
Zelos provides a Zookeeper-like interface that supports CRUD operations on a hiearchical structure of nodesSee the Zookeeper documentation for more details. (among other, more advanced functions).
When covering Zelos, the paper discusses how internal customer needs stemming from the Zookeeper-port shaped the Delos designThe paper also notes another pivot: “The Delos project was initially conceived with the goal of adding a quorum-replicated Table store to this menagerie of distributed systems, filling a gap for applications that required the fault-tolerance of ZooKeeper with the relational API of MySQL. However, a secondary goal soon emerged: could we implement the ZooKeeper API on the same codebase as this new Table store, eliminating the need to maintain and operate a separate ZooKeeper service?” :
Our initial design for Delos involved a reusable platform layer exposing an SMR API, allowing any arbitrary application black box to be replicated above it. The platform itself is also a replicated state machine, containing functionality generic to applications…Unfortunately, structuring the platform as a monolithic state machine limited its reusability. When the ZooKeeper team at Facebook began building Zelos on the Delos platform, they needed to modify the platform layer to obtain additional properties such as session ordering guarantees, batching / group commit, and nonvoting modesThese unique properties of Zookeeper are discussed in a later section .
Because these unique features of Zookeeper were too difficult to implement in a monolithic architecture, the Delos design pivoted to a stack-like, engine-based approach.
Engines
The paper describes nine different engines that comprise common functionality. I focus on three that highlight Delos’ strengths: the ObserverEngine, SessionOrderEngine, and BatchingEngine.
The ObserverEngine is placed between different layers of a Delos stacks, and provides reusable monitoring functionality by tracking the time spent in a given engine.
The SessionOrderEngine implements the idea of Zookeeper sessionsThe original Zookeeper paper seems to discuss this idea in the Zookeeper guarantees section. Documentation on mechanics of Zookeeper sessions is here. :
ZooKeeper provides a session-ordering guarantee: within a session, if a client first issues a write and then a concurrent read (without waiting for the write to complete), the read must reflect the write. This property is stronger than linearizability, which allows concurrent writes and reads to be ordered arbitrarily; and encompasses exactly-once semanticsThe Delos paper references Implementing Linearizability at Large Scale and Low Latency when discussing how Zookeeper’s guarantees are stronger than linearizability (definition of linearizability here). The system proposed by this paper uses log-based recovery on the server to ensure that if clients retry a request after crashing, the system will preserve linearizability. When reading the referenced paper, I also found this talk from the authors and this review from The Morning Paper.
Delos implements these semantics in the SessionOrderEngine by assigning sequence numbers (essentially autoincrementing IDs) to outgoing writes. When other nodes read from the log, they check that the writes are ordered based on sequence number, reordering them into the correct sequence as necessaryThe Delos paper mentions that “disorder can occur due to leader changes within the log implementation, or due to code changes in the Delos stack”. .
The BatchingEngine groups entries into a single transaction write to the LocalStore. This approach enables higher performance and provides a common implementation that both DelosTable and Zelos use (related to Delos’ design goal of code re-use).
Evaluation
The paper evaluates Delos log-structured protocols on two dimensions: the overhead (if any) inherent to the design, and the performance/productivity gains that the design allows.
When evaluating overhead, the paper considers the apply
thread (as this upcall relates to the different transitions between each engine). The paper notes that of the CPU consumed in the fleet, apply only makes up 10% of the utilization.
The second main category of results is related to the benefits of code-reuse. One example that the paper cites is the introduction of the BatchingEngine discussed in the previous section. The deployment of the BatchingEngine was relatively straightfoward and contributed to a 2X throughput improvement. Furthermore, the engine could be rolled out to other protocols.
Conclusion
I greatly enjoyed this paper! The paper’s authors have been researching related topics for some timeMahesh published a paper on building data structures from a shared log at SOSP’13. , and seeing their expertise applied to a new production setting was quite interesting. Additionally, the newest Delos papers share production-focused experiences, and a design guided by collaboration with internal customers - it is always fun to read about rubber-meets-the-road approaches!
As always, feel free to reach out on Twitter with feedback! Until next time.