Verifying Distributed Systems with Verdi

After attending OPLSS last year, working through a good chunk of Software Foundations a few months back, and reading a bunch of papers since then, I feel like I’m getting a grasp on the formal verification research program.¹ Which is great! The research is interesting and feels a bit magical, and it could help mitigate certain kinds of failures. The especially interesting part to me, though, is that it provides a new way to think about software engineering as a mathematically sound discipline.

Here’s the big idea: if we have a formally defined specification of how our program should behave, and if we have a mechanism for translating between programs and specifications, we should be able to prove that our program matches the specification (“formally verifying” it).

In practice, this often means writing our specification in a language like Coq, then using its extraction facility to synthesize a program—usually in OCaml—that has the same semantics as our specification.²

Anyway, I’ve been plowing through these papers, and I just read an especially interesting and approachable one on a system called Verdi. It came out of the PLSE group at the University of Washington and was published as Verdi: A Framework for Implementing and Formally Verifying Distributed Systems at PLDI ‘15. The ideas in the paper are pretty interesting, and I’d like to summarize them here as notes for my future self.

Implementing distributed systems accurately is hard. Even a correct specification is difficult to implement. An implementation needs to account for the many ways a network can fail: packets could be reordered, duplicated, or just dropped, and nodes can crash and reboot. Writing a formal specification of all these failure cases for every system is intractable.

We’d rather not think about these sorts of problems while implementing our system. It would be lovely if we could write a verified implementation of our algorithm that was provably correct under ideal network conditions (e.g., with no dropped packets or crashed nodes), then use a tool to automatically produce a new implementation that correctly handled the possible failure cases. That’s exactly what Verdi provides.

Let’s introduce some terms:

A system is a specification of our algorithm written in Coq. Since Coq proofs can be extracted into OCaml programs, this doubles as an implementation.

A system contains a collection of nodes. Each node has a name, some internal state, and behavior governing how it sends and receives messages.

At any given time the system will be in a state \((P, \Sigma, T)\), where \(P\) is the collection of packets in flight, \(\Sigma\) is the state of the nodes in the network, and \(T\) is the history (“trace”) of I/O actions (messages).

For each fault model (e.g., reordered packets, crashing nodes, etc), the authors develop a network semantics that formally describes the behavior of that system. Each semantics is a step relation (denoted with a “\(\rightsquigarrow\)”) between states of the network.

As an example, the rule for duplicating a packet looks like this:

\[\frac{ p \in P }{ (P, \Sigma, T) \rightsquigarrow_\text{dup} (P \uplus \{p\}, \Sigma, T) }\]

This says: if we suppose that there’s a packet \(p\) on the wire, then duplicating the packet means updating the state of the system to a new state where an extra copy of \(p\) has been added to the set of all packets \(P\). (The “\(\uplus\)” there is a bag union operator, since \(P\) has to be a multiset to handle duplicates.)

So, we’ve got a semantics for each of our fault models.³ How can we use these to transform our ideal specification into a more resilient one that can operate under these fault models?

The distributed systems community has developed effective solutions to each of these fault models. For example, a system might handle packet duplication by assigning a sequence number to each packet and ignoring ones it’s already seen. We’d like to be able to apply these solutions.

Verdi addresses this by providing a mechanism for building verified system transformers (VSTs). A VST is a function that maps from a system operating under one set of network semantics to a new system operating under a different semantics. I’ve been intuitively thinking of VSTs as formal encodings of distributed systems design patterns (like packet sequencing).

So, if you’ve built a specification of your algorithm in Coq which is provably correct under ideal network conditions, you can apply a series of VSTs to it to generate a new specification. That specification will maintain all the old properties of your original specification, but will be automatically augmented by the patterns applied by the VSTs. Your relatively simple specification can be transformed into one that can handle node failure and packet reordering with no additional work on your part, and still remain provably correct! And, since this is all in Coq, we can extract an OCaml implementation of the new specification automatically.

That’s so cool!

The authors applied this to a few well-known distributed systems algorithms like a simple key-value store and a primary-backup replication system. In a really impressive move, though, they also used Verdi to generate the first verified implementation of the Raft consensus protocol. Again, so cool! Consensus is a huge pain to implement, and extremely hard to get correct, so this seems like a really big deal.

Verdi’s available on GitHub, by the way (where it deserves more stars), as is the Verdi-derived Raft implementation.

Or sometimes “deep specification.” It’s an emerging field, so the way it talks about itself is still a little loose. ↩
Note that this solves the problem of verification by determining if we’re building what we intended to build. It doesn’t address the more intractable problem of validation, i.e., “are we building the right thing?” Both problems need attention, though, so I don’t really see this as a failure. ↩
The Coq toolchain that comes with Verdi also includes mechanisms for developing your own network semantics to handle novel fault models. ↩

You might like these textually similar articles: