starting points in distributed systems

Starting Points in Distributed Systems, Pt. III: Epicyclic Explorations

This is the third and final part of a three-part miniseries about building a conceptual foundation in distributed systems through independent study. In this series, I sketch out the map that I wish I’d had when I started studying last year, drawing from my own experience and currently available resources. My hope is that this informal guide will assist fellow students on their own journeys.

Epicyclic Explorations

Once you’ve made it this far, you should have at least a broad sense of the complexity and richness of distributed systems as a field of research and practice, and a basic ability to locate yourself and navigate within those spaces. From here, your options open up and things really start to get interesting.

I’ve tried to organize my own studies from this point in terms of deep dives into either a particular concept or system. In my experience, the closer I come to grokking a concept, the less conceptual or topical divisions make sense and the more interconnected everything appears. That said, as an initial approach, this divide-and-conquer strategy has worked well for me.

The basic procedure is to begin with a foundational exposition or two, and then follow connections to other work as either your curiosity leads or comprehension demands. Sometimes this will mean following explicit citations or hyperlinks. Other times, it will mean trying to better understand a key term (e.g., linearizability) or context (e.g., Google’s architecture at a particular point) in a given discussion by independently searching for more information.

At the risk of both mixing and straining metaphors, I think of this kind of reading a bit like tracing a variety of nearer and more distant orbits around your starting document-sun, jumping from paper-planet to paper-planet, but always circling back and changing your viewing perspectives in the process. The mental pathways might resemble a rough sketch of Ptolemaic epicycles or even epicyclic gearing. You’re drafting and re-drafting a model of a system of knowledge.

Two essential aspects of this activity are the cyclical movement and ongoing addition of new information and perspectives. The goal is not to totally understand everything you read the first time, but to keep returning with a deeper understanding, or, to borrow a phrase from one of my own best teachers, to be “confused at a higher level.”

I do realize that that was very abstract. Let’s look at some concrete examples of how you could begin a concept-focused exploration and a system-focused exploration. For this first case, we’ll take a look at the ubiquitous but still frequently-misunderstood CAP theorem. For the second, let’s look at Riak as a Dynamo system.

Example start for a concept- or topic-focused exploration: the CAP theorem

“CAP,” as Michael Bernstein reminds us, “is an acronym, which is a super easy thing to make sh*t up about.” Let’s try to avoid that temptation through education.

Either Henry Robinson’s FAQ or Mikito Takada’s treatment, both cited in an earlier post, would make a good starting point.

You could then proceed directly from the latter to Seth Gilbert and Nancy Lynch’s 2012 paper, “Perspectives on the CAP theorem”, and/or to Eric Brewer’s own “CAP Twelve Years Later: How the ‘Rules’ Have Changed”.

From there, you might…

…go to the proof itself: Seth Gilbert and Nancy Lynch’s “Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services.”
…notice that both the “Perspectives on the CAP theorem” and “Twelve Years Later” pieces initially appeared in the same 2012 issue of IEE Computer, which, in fact, presents a CAP retrospective. A little digging might lead you to Daniel Abadi’s blog post on that issue and, from there, his work on PACELC.
…better contextualize CAP consistency (i.e., “atomic,” or “linearizable” consistency ) by reading more about consistency models in Doug Terry’s “Replicated Data Consistency Explained Through Baseball.”
…learn more about how systems respond to network partitions by watching and/or reading Kyle Kingsbury’s Jepsen series of talks and blog posts. A brief, more formal presentation of the first part of the Jepsen project is also available here.
…or follow any other path as your interest leads.

Notice that any of these options would help to contextualize the others and help improve your mental model of the distributed systems space. Also notice that any of these choices would present further connections to follow and open up still further perspectives. For instance, to follow up on CAP consistency, you might go read the original paper that defined linearizability, and that paper in turn might be a starting point for a new exploration.

As an exercise, try reading either Gilbert and Lynch’s “Perspectives on the CAP theorem” or Brewer’s “Twelve Years Later” a second time after reading three of four other articles, or even a good blog series, and see how much more you can get out of a second reading. Now we’re confused at a higher level!

Example start for a system-focused exploration: Riak as a Dynamo System

In this type of exploration, you’d pick a system of interest to you and work through the most relevant formal description(s) and the best documentation for at least one implementation. If possible, you’d do some work with the implementation yourself as part of the learning process. I’ve found this type of exploration invaluable in building an understanding of how algorithms and concepts come together in real systems.

We’ll take Riak as an example here. Riak is a highly available, eventually consistent distributed database descended from the influential Dynamo paper. Among Dynamo systems, Riak is a good choice for an independent study due to the relatively clear lines of development from Dynamo to Riak, the especially high quality of Basho’s documentation, and the open-source availability of the software.

In the context of these “starting points” posts, Riak is also a good choice since Mikito Takada’s fifth chapter provides an introduction to Dynamo as well as the CRDT research supporting the new data types soon to be available in Riak 2.0.

Here, you might start historically, with the Dynamo paper itself:

Giuseppe DeCandia et al., “Dynamo: Amazon’s Highly Available Key-Value Store,” in Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007.

You might also start with Eric Redmond’s excellent Little Riak Book.

After those, I’d move to Basho’s Riak documentation, with special attention to the “Theory and Concepts” section.

I’d cap off my start by setting up a Riak cluster and client on my local machine.

Once you’ve made it this far, you’d be well positioned to either branch off into papers on topics like consistent hashing and CRDTs with a deeper sense of context, or move to a comparative study of another system. You could also dig into the Riak code itself.

Series Conclusion

“We have not succeeded in answering all our problems—indeed we sometimes feel we have not completely answered any of them. The answers we have found have only served to raise a whole set of new questions. In some ways we feel that we are as confused as ever, but we think we are confused on a higher level, and about more important things.”
— Earl C. Kelley, The Workshop Way of Learning (1951)

In these last few posts, I’ve described the primary processes through which I’ve built my own conceptual foundation in distributed systems. I know that I still have a lot to learn. While I certainly don’t claim to have mastered this material yet, I will claim to have achieved confusion at a higher level. The resources and model I’ve described here have helped me advance. I’ve taken the time to share them in the hope that they can help other independent learners too.

I encourage anyone reading this to do your own exploring and let one topic lead to another. Keep reading and thinking. You can source topics, systems, and articles from Mikito Takada’s book, Chris Meiklejohn’s “Readings in Distributed Systems” list or Think Distributed podcasts, and groups like the Distributed Systems Reading Group at MIT.

As a final exhortation, don’t forget the human element! Note the people writing the blog posts, books, articles, and code you read and follow them on Twitter. Join a mailing list or two. See if you can attend a relevant conference or meet-up and make some contacts in person. Reach out to people in general. I’ve personally found that most people in the distributed systems community can be quite kind and helpful to newcomers, especially if you’ve bothered to do some homework ahead of time. Distributed systems are hard, but you might be surprised by how many people out there are willing to help you find your way.

Starting Points in Distributed Systems, Pt. II: Orientation and a Survey

This is the second part of a three-part miniseries about building a conceptual foundation in distributed systems through independent study. In this series, I sketch out the map that I wish I’d had when I started studying last year, drawing from my own experience and currently available resources. My hope is that this informal guide will assist fellow students on their own journeys.

Orientation

To begin the next part of your study, I’d start with some definitions, commonly cited reference points, and a bit of history. This might seem like a lot to start with, but each of the following pieces is fairly brief and accessible.

For definitions, you might simply start with the Wikipedia entry on “Distributed Computing” and follow links from there as interest or comprehension demands. Note that the scope of distributed computing as a topic exceeds the scope of these “Starting Points” posts, as well as most resources that you’ll probably use in your self-education.

Continue building your working definition of a distributed system with the first chapter of Mikito Takada’s book, Distributed Systems for Fun and Profit. We’ll return to this book later.

To prime your thinking as you go deeper into this space, familiarize yourself with the “Fallacies of Distributed Computing” as Takada suggests in his book, and check out a few different treatments like those by Arnon Rotem-Gal-Oz (“Fallacies of Distributed Computing Explained”) or Brian Doll (“The Fallacies of Distributed Computing Reborn: The Cloud Era”).

Read Jeff Hodges’ emerging classic, “Notes on Distributed Systems for Young Bloods.”

Watch and/or read Michael Bernstein’s “Distributed Systems Archaeology” to help gain an appreciation for the history of the field, as well as inspiration for taking on your own extended reading project.

Surveying the Landscape

At this point, you should be sufficiently acclimatized to the definition, practice, and history of distributed computing to really benefit from a survey of the landscape. This next step will give you a framework with which to relate your subsequent, deeper trips into more specific topics and studies.

Here, I highly recommend reading through the entirety of Mikito Takada’s Distributed Systems for Fun and Profit. This is the integrative, overarching survey that I most wish I’d had when starting out. I strongly recommend reading (and re-reading) this book. Takada has many excellent citations and links to follow, but before diving into too many of those, I’d try to get a broad view by reading the book once through.

If, like me, you’re particularly interested in distributed databases, you might also like to read Pramod J. Sadalage and Martin Fowler’s helpful book, NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence (ISBN 978-0321826626). Like Takada’s work, this is a high-level survey that can help you organize and relate the next phases of your learning. Try not to get caught up in often-unhelpful “NoSQL” label and be sure to remember that a system composed of clients communicating with even a single database server over a network is still a distributed system, whether that database is relational or not.

In the next post, I introduce more resources for continuing your journey.

Starting Points in Distributed Systems, Pt. I: Background

This is the first part of a three-part miniseries about building a conceptual foundation in distributed systems through independent study. In this series, I sketch out the map that I wish I’d had when I started studying last year, drawing from my own experience and currently available resources. My hope is that this informal guide will assist fellow students on their own journeys.

Background: my experience

During the second half of 2013 I undertook an independent study of distributed computer systems, with a particular focus on distributed databases. Like many others–or so I imagine–I started with a few conversations with a knowledgeable friend and a few citations to follow: Leslie Lamport’s “Time, Clocks, and the Ordering of Events in a Distributed System”, the Dynamo paper, and a recommended FAQ or two about something called the CAP theorem. Over the following months, I expanded my knowledge by branching outward from those starting points. I built a conceptual foundation in the field primarily through a process of reading papers, articles, blog posts, and product documentation, and by following the citation chains between sources. In addition to internal citations, I also found more personal direction through helpful fellow attendees of conferences and meet-ups, and more programmatic direction through university course websites and reading lists.

Although my academic training in another field prepared me to approach a new field by tracing webs of citations and joining networks of students, researchers, and teachers, I still found this to be an especially challenging process. In particular, I often felt a need for more integrative and intermediate resources for students. By volume, most of the resources on distributed systems that I found fell into one of two categories: i) documentation and marketing presentations of a particular software product or service, or ii) formal papers by academic researchers and/or industry practitioners. Working with resources of the first type, I had difficulty identifying broader contexts for particular systems, their relationships and interconnections, and their conceptual foundations. In papers of the second type, I struggled with unfamiliar formalisms and the high level of assumed background in computer science and mathematics. I found myself wanting an overarching structure to help integrate concepts and approaches across systems.

Since I started learning in the spring of 2013, new resources have been created to help students, both by simplifying the task of assembling and navigating the relevant literature, and by offering an accessible introduction to distributed systems as a field. To really engage the distributed systems literature, you’ll still need venture into the deep, dark, forest of formal papers for and by specialists, but a decent map or two can dramatically facilitate your progress.

In that spirit, the rest of this series describes where I would tell a friend–or my past self–to start today. I begin here, with a few prerequisites, continue next time with an orientation in, and survey of part of the distributed systems landscape, and conclude with a third post, in which I sketch out an exploratory method for continuing the journey.

Prerequisites, or a prelude

This may go without saying, but to make substantial progress towards understanding distributed systems, you’ll need at least a working knowledge of computer programming, systems, and networking.

For help building that knowledge in computer networking, I can personally recommend the Computer Networks course on Coursera offered by the University of Washington’s David Wetherall, Arvind Krishnamurthy, and John Zahorjan. This free course is taught “at the upper-undergraduate level.” I found the lectures, problem sets, and exams to be challenging but accessible. Since this course focuses in large part on the Internet and web applications, you’ll actually be learning quite a bit about distributed systems and some key algorithms already. In that sense, you might think of this, or a similar course, as a prelude to your more extended study.

If you push through to formal computer science papers in later parts of your studies, you’ll also want to be able to read and think with formal proofs and descriptions of algorithms. When I started, I wasn’t anywhere near the level I needed to be on to work through papers by researchers like Nancy Lynch and Leslie Lamport. I found that I could learn quite a bit from the introductory and concluding sections of papers, but I needed to do some substantial catch-up work in mathematics to be able to (start to) follow the meatier sections.

If you’re in a similar position, I can personally recommend Daniel J. Velleman’s How to Prove It: A Structured Approach (2nd ed., ISBN: 978-0521675994) on proofs and Tom Jenkyns and Ben Stephenson’s Fundamentals of Discrete Math for Computer Science: A Problem-Solving Primer (ISBN: 978-1447140689) on algorithms. Unlike the overwhelming majority of resources I’ll be mentioning in these posts, these aren’t free, but they are relatively inexpensive paperbacks that I consider to have been excellent investments.

In the next post, we’ll start to better orient ourselves in the distributed systems landscape and take on a survey.