Distributed Hash Table Part One

Introducing the theory of a DHT and creating a simple process ring.

In part one we introduce the concept of a distributed hash table and build a simple process ring. In part two we'll simulate a DHT with processes running in the same node. Finally, in part three we'll extend the process DHT to a full-blown, multi-node DHT.

Complete code for this tutorial is available in the GitHub repository.

What is a Distributed Hash Table?

Most developers are familiar with hash tables. Store a value, get a key. Lookup a key, get a value. But what happens if you need to scale this idea to a second computer? How do you implement a hash table across multiple physical machines? Enter the distributed hash table: it has some intriguing characteristics that are impossible with a vanilla hash table:

These properties make DHTs an indispensable tool in modern software systems. They are the reason why BitTorrent can find files without using a centralized server, and how modern non-relational (so called "No SQL") data-stores can achieve massive scale.

Consistent Hashing and the Keyspace

The central idea that allows DHTs to fulfill all of the goals above is consistent hashing within a finite keyspace. Like the humble stack, let's deal with those ideas 'last things first and first things last'.

As with a regular hash table, the keys in a DHT are fixed length hashes of the value being stored. A common choice for hashing would be a 160-bit SHA-1 hash, and that's what we will be using in this tutorial series. Given that we've chosen a 160-bit key, it follows that the total number of possible keys is 2160. So, you can think of the keyspace as the set of all possible keys. All 2160 of them. It should now be obvious that the keyspace must be finite in size, since we know exactly how many keys it contains. (Note: if you're familiar with linear algebra, the keyspace is somewhat akin to a finite vector space.)

Since our data is stored across many nodes, one characteristic of a regular hash table that we do not want to borrow is the idea of key rebalancing, and this is where consistent hashing comes into play. When a bucket is added to a regular hash table, every key stored in it must be recomputed, and this means that most of the keys and values also get moved around to different buckets. In a DHT, each physical node (machine) acts like a bucket in a regular hash table, so we want to must avoid such a costly rebalancing operation between machines that may be, geographically, very far apart. Fortunately, consistent hashing solves this problem.

Each node in the system will be given responsibility for a contiguous subset of this keyspace. So, hypothetically, if we had n nodes, each node would be the authoritative location for 2160/n keys. And which node is responsible for storing a given key? The first node whose ID value (more on this later) is greater than or equal to the key. Such a node is called the "successor node" of the key.

A Simple Example

To clarify the point, imagine the following small-scale scenario. Suppose the keys are the integers 0, 1, ..., 7, then the keyspace contains only those 8 possible keys. We can imagine the keyspace as a ring with the keys spaced around it clockwise, starting with 0 at the 12 o'clock position.

Credit: http://www.cs.nyu.edu/courses/fall07/G22.2631-001/Chord.ppt

As each node joins the network, it chooses a key at random to use as its ID. To locate the successor node responsible for a given key, we begin at the first node (with ID 0, in this case) and move clockwise around the ring until we locate the node whose ID is immediately greater than or equal to the key value. (Of course, we must take care to handle the wrap-around point on the ring when moving past key 7 back to key 0.)

From the diagram it's easy to understand why rebalancing is so much easier under this scheme. Suppose node 1 leaves the network. To preserve its data key 1 would be moved to node 3, but the keys already stored at nodes 0 and 3 would be unaffected. Likewise, if we added a node with ID 6 then the key 6 would have to be moved from node 0 to node 6, but the keys stored on nodes 1 and 3 would be unaffected. This is far superior to the rebalancing situation in a conventional hash table when virtually all of the keys would be moved between buckets.

This idea of locating the successor node by comparing node IDs with keys (since both are hashes in the same keyspace) is the central concept to making a working DHT. Virtually all the operations - joining the ring, storing values, finding values, replication, etc - are dependent on being able to locate the successor node to a given key.

Implementation Baby Steps

Even with all of this theory in hand, building a full-blown distributed hash table is a confusing affair. Rather than diving right in, we're instead going to build up a DHT in three stages, starting with something that seems kind of unrelated: a very simple ring of processes.

In terms of architecture, this process ring will be essentially a cyclical doubly-linked list. After spawning some processes, we'll link each process with it's neighbours, called Next and Prev, making sure to link the first process and last process together in the same way (much like a dog chasing its own tail).

When the ring is linked-up and ready to go, we will then pass messages around the ring in either direction (clockwise or anti-clockwise) and use a counter to expire the message after it has been handled by a set number of processes in the ring.

Let's look at the code, starting with the obvious stuff: spawning the processes.

The only argument to the start/1 function is the number of processes to create in the ring. A list comprehension is used to spawn the desired number of processes, then we link each process to it's neighbours with the link_all/1 and join the ends of the list together with the call link(Last, First).

Linking any two adjacent process together is as simple as sending each a message indicating that the other process is a neighbour:

And linking all the processes in a list (except for linking the first and last process) simply works through the list two items at a time and calling link/2:

At this point we've got a ring of process who can each communicate with its neighbours. The server loop in which each process is waiting is very simple. The patterns that handle setting the neighbour references and stopping the process don't warrant any interest. Instead, let's look at how messages get passed around the ring:

When a message has been handled, or 'relayed', at least NumRelays times we print a message indicating the message expiry and discard it. For messages that have not yet expired, the relaying process will select its Next or Prev neighbour depending on the desired Direction then pass the message along to that process (making sure to increment the relay counter).

We can start a message traveling around the ring by calling send_cw/3 and send_anti_cw/3. Here's how it looks in practice:

> c(ring).
> FirstProccess = ring:start(5).
> ring:send_anti_cw("Hi", FirstProccess, 10).
> {relay,anti_clockwise,10,1,"Hi"}
> <0.38.0> Relaying 1 "Hi"
> <0.42.0> Relaying 2 "Hi"
> <0.41.0> Relaying 3 "Hi"
> <0.40.0> Relaying 4 "Hi"
> <0.39.0> Relaying 5 "Hi"
> <0.38.0> Relaying 6 "Hi"
> <0.42.0> Relaying 7 "Hi"
> <0.41.0> Relaying 8 "Hi"
> <0.40.0> Relaying 9 "Hi"
> <0.39.0> Relaying 10 "Hi"
> <0.38.0> Dropping message "Hi" after 10 relays.

This process ring may seem overly simple, but it's a critical step in creating a functioning distributed hash table. In the next tutorial we're going to build a simulated DHT. It will have the store and lookup features of a DHT, but rather than being distributed across physical machines we'll use a process ring similar to this one to simulate a network, with each process acting as a separate DHT node. In the third tutorial we'll take the features from that simulated DHT and package them in a system which can be deployed across multiple Erlang nodes and physical machines to work as a complete distributed hash table.