Starting Points in Distributed Systems, Pt. III: Epicyclic Explorations

This is the third and final part of a three-part miniseries about building a conceptual foundation in distributed systems through independent study. In this series, I sketch out the map that I wish I’d had when I started studying last year, drawing from my own experience and currently available resources. My hope is that this informal guide will assist fellow students on their own journeys.

Epicyclic Explorations

Once you’ve made it this far, you should have at least a broad sense of the complexity and richness of distributed systems as a field of research and practice, and a basic ability to locate yourself and navigate within those spaces. From here, your options open up and things really start to get interesting.

I’ve tried to organize my own studies from this point in terms of deep dives into either a particular concept or system. In my experience, the closer I come to grokking a concept, the less conceptual or topical divisions make sense and the more interconnected everything appears. That said, as an initial approach, this divide-and-conquer strategy has worked well for me.

The basic procedure is to begin with a foundational exposition or two, and then follow connections to other work as either your curiosity leads or comprehension demands. Sometimes this will mean following explicit citations or hyperlinks. Other times, it will mean trying to better understand a key term (e.g., linearizability) or context (e.g., Google’s architecture at a particular point) in a given discussion by independently searching for more information.

At the risk of both mixing and straining metaphors, I think of this kind of reading a bit like tracing a variety of nearer and more distant orbits around your starting document-sun, jumping from paper-planet to paper-planet, but always circling back and changing your viewing perspectives in the process. The mental pathways might resemble a rough sketch of Ptolemaic epicycles or even epicyclic gearing. You’re drafting and re-drafting a model of a system of knowledge.

Two essential aspects of this activity are the cyclical movement and ongoing addition of new information and perspectives. The goal is not to totally understand everything you read the first time, but to keep returning with a deeper understanding, or, to borrow a phrase from one of my own best teachers, to be “confused at a higher level.”

I do realize that that was very abstract. Let’s look at some concrete examples of how you could begin a concept-focused exploration and a system-focused exploration. For this first case, we’ll take a look at the ubiquitous but still frequently-misunderstood CAP theorem. For the second, let’s look at Riak as a Dynamo system.

Example start for a concept- or topic-focused exploration: the CAP theorem

“CAP,” as Michael Bernstein reminds us, “is an acronym, which is a super easy thing to make sh*t up about.” Let’s try to avoid that temptation through education.

Either Henry Robinson’s FAQ or Mikito Takada’s treatment, both cited in an earlier post, would make a good starting point.

You could then proceed directly from the latter to Seth Gilbert and Nancy Lynch’s 2012 paper, “Perspectives on the CAP theorem”, and/or to Eric Brewer’s own “CAP Twelve Years Later: How the ‘Rules’ Have Changed”.

From there, you might…

  • …go to the proof itself: Seth Gilbert and Nancy Lynch’s “Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services.”
  • …notice that both the “Perspectives on the CAP theorem” and “Twelve Years Later” pieces initially appeared in the same 2012 issue of IEE Computer, which, in fact, presents a CAP retrospective. A little digging might lead you to Daniel Abadi’s blog post on that issue and, from there, his work on PACELC.
  • …better contextualize CAP consistency (i.e., “atomic,” or “linearizable” consistency ) by reading more about consistency models in Doug Terry’s “Replicated Data Consistency Explained Through Baseball.”
  • …learn more about how systems respond to network partitions by watching and/or reading Kyle Kingsbury’s Jepsen series of talks and blog posts. A brief, more formal presentation of the first part of the Jepsen project is also available here.
  • …or follow any other path as your interest leads.

Notice that any of these options would help to contextualize the others and help improve your mental model of the distributed systems space. Also notice that any of these choices would present further connections to follow and open up still further perspectives. For instance, to follow up on CAP consistency, you might go read the original paper that defined linearizability, and that paper in turn might be a starting point for a new exploration.

As an exercise, try reading either Gilbert and Lynch’s “Perspectives on the CAP theorem” or Brewer’s “Twelve Years Later” a second time after reading three of four other articles, or even a good blog series, and see how much more you can get out of a second reading. Now we’re confused at a higher level!

Example start for a system-focused exploration: Riak as a Dynamo System

In this type of exploration, you’d pick a system of interest to you and work through the most relevant formal description(s) and the best documentation for at least one implementation. If possible, you’d do some work with the implementation yourself as part of the learning process. I’ve found this type of exploration invaluable in building an understanding of how algorithms and concepts come together in real systems.

We’ll take Riak as an example here. Riak is a highly available, eventually consistent distributed database descended from the influential Dynamo paper. Among Dynamo systems, Riak is a good choice for an independent study due to the relatively clear lines of development from Dynamo to Riak, the especially high quality of Basho’s documentation, and the open-source availability of the software.

In the context of these “starting points” posts, Riak is also a good choice since Mikito Takada’s fifth chapter provides an introduction to Dynamo as well as the CRDT research supporting the new data types soon to be available in Riak 2.0.

Here, you might start historically, with the Dynamo paper itself:

Giuseppe DeCandia et al., “Dynamo: Amazon’s Highly Available Key-Value Store,” in Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007.

You might also start with Eric Redmond’s excellent Little Riak Book.

After those, I’d move to Basho’s Riak documentation, with special attention to the “Theory and Concepts” section.

I’d cap off my start by setting up a Riak cluster and client on my local machine.

Once you’ve made it this far, you’d be well positioned to either branch off into papers on topics like consistent hashing and CRDTs with a deeper sense of context, or move to a comparative study of another system. You could also dig into the Riak code itself.

Series Conclusion

“We have not succeeded in answering all our problems—indeed we sometimes feel we have not completely answered any of them. The answers we have found have only served to raise a whole set of new questions. In some ways we feel that we are as confused as ever, but we think we are confused on a higher level, and about more important things.”
Earl C. Kelley, The Workshop Way of Learning (1951)

In these last few posts, I’ve described the primary processes through which I’ve built my own conceptual foundation in distributed systems. I know that I still have a lot to learn. While I certainly don’t claim to have mastered this material yet, I will claim to have achieved confusion at a higher level. The resources and model I’ve described here have helped me advance. I’ve taken the time to share them in the hope that they can help other independent learners too.

I encourage anyone reading this to do your own exploring and let one topic lead to another. Keep reading and thinking. You can source topics, systems, and articles from Mikito Takada’s book, Chris Meiklejohn’s “Readings in Distributed Systems” list or Think Distributed podcasts, and groups like the Distributed Systems Reading Group at MIT.

As a final exhortation, don’t forget the human element! Note the people writing the blog posts, books, articles, and code you read and follow them on Twitter. Join a mailing list or two. See if you can attend a relevant conference or meet-up and make some contacts in person. Reach out to people in general. I’ve personally found that most people in the distributed systems community can be quite kind and helpful to newcomers, especially if you’ve bothered to do some homework ahead of time. Distributed systems are hard, but you might be surprised by how many people out there are willing to help you find your way.