Thoughts on Systems

Emil Sit

Mar 1, 2009 - 3 minute read - Research Technology chord dht implementation

On implementing Chord

The Chord protocol dynamically constructs robust and scalable overlay networks that map a given key to an active node. The MIT PDOS Chord implementation has served as a reference implementation of Chord, and over the years has accumulated many tweaks and improvements. While the theoretical highlights have largely been documented in our publications, the implementation details may have been omitted. Given the relative multitude of re-implementations I know of, it seemed worthwhile to try and identify some of these lessons instead of leaving them to bit-rot inside our implementation. A basic understanding of Chord is assumed. It is also important to realize that MIT Chord was largely run over the wide-area, not in a data-center.

MIT Chord is implemented in C++ using libasync. We did not like Java because (at least a few years ago) PlanetLab did not make it easy to push a JRE to all your nodes and would also kill processes with large memory footprints (which seemed to often be Java VMs). So, using C++ made deployment to PlanetLab easier and the code more efficient. Today, I might use the STL and select bits from Boost, which would make the implementation accessible to a broader audience. I’d probably also use threads.

Because of its libasync roots, MIT Chord uses XDR and SunRPC for its inter-node communcation. It would be better today to use something like Protocol Buffers or Thrift; the front-end should definitely support XML-RPC for maximum ease-of-application development. Other than NFS, basically no one uses SunRPC.

Chord uses its own transport called the Striped Transport Protocol (STP). Read the section, think about the problem, decide if STP is right for you. For actual data transfers, depending on your object size, it may be simpler to go with TCP. UDP without flow-control may be just fine for the routing RPCs, though managing retransmission timers and mapping it to node liveness will be lots of fun. Also read about Accordion’s approach to this issue.

Much of a DHT implementation revolves around keeping track of other nodes (e.g., your neighbors). The Chord paper describes the abstract successor list and finger table; our implementation keeps a bucket of nodes (the locationtable) that calculates successors and fingers on the fly from this table. The Chord stabilization protocol runs periodically to keep this bucket filled with relevant live nodes. This approach also let us easily implement Accordion in our codebase. A single reference-counted object is used to proxy remote nodes: this keeps all state about a remote node consistent, even if it is part of multiple in-flight operations.

Nodes often fail and MIT Chord works hard to stop using failed nodes, despite some Chord operations that pass node lists around, by tagging the node wire representation with the time since the node was last contacted directly. If node A has discovered for itself that node B has failed, it will ignore any mention of B from its neighbors that have not directly heard from B more recently. (This doesn’t quite work right in the case of asymmetrically reachable nodes, but it is unclear what to do with such nodes anyway.)

Definitely include tools to view a live node’s state. If you’re feeling fancy, your nodes will speak HTTP on a special port and produce pretty graphics (MIT Chord doesn’t). This will significantly improve your ability to debug weird routing issues (remotely).

These thoughts are a bit haphazard but hopefully they contain some tidbits of interest. I’ve actually had relatively little interaction with other implementors of Chord so if you’re one of them, I’m curious how your experiences compare.