Distributed Consensus Guide | Raft, Paxos, and Coordination

Distributed consensus algorithms enable multiple nodes to agree on values despite failures and network partitions. This agreement is fundamental to distributed systems: electing leaders, committing transactions, and maintaining replicated state all require consensus. Understanding these algorithms helps you choose and configure distributed systems appropriately.

The consensus problem sounds simple: get all nodes to agree on a value. But in distributed systems, nodes can fail, messages can be lost or delayed, and network partitions can split the cluster. Algorithms must reach agreement despite these failures while remaining available and making progress.

The Challenge of Consensus

The FLP impossibility result proves that no deterministic consensus algorithm can guarantee agreement in an asynchronous system where even one node might fail. This theoretical limit shapes all practical consensus algorithms.

Practical systems work around FLP using timeouts to detect failures, randomization to break symmetry, and accepting that consensus might not always succeed. The goal becomes making consensus highly probable rather than guaranteed.

Consensus algorithms must satisfy three properties. Agreement means all nodes that decide choose the same value. Validity means the decided value was proposed by some node. Termination means all non-faulty nodes eventually decide. Balancing these properties against failure tolerance defines each algorithm's tradeoffs.

Paxos

Paxos, developed by Leslie Lamport, was the first proven consensus algorithm for asynchronous systems. Despite its historical importance and theoretical elegance, Paxos is notoriously difficult to understand and implement correctly.

Paxos divides nodes into proposers, acceptors, and learners. A value is chosen when a majority of acceptors accept it. The algorithm proceeds in two phases: prepare and accept.

In the prepare phase, a proposer selects a proposal number and sends prepare requests to acceptors. Acceptors promise not to accept proposals with lower numbers and report any values they've already accepted.

In the accept phase, if the proposer receives promises from a majority, it sends accept requests with the highest-numbered value it received (or its own value if none). Acceptors accept unless they've promised to a higher proposal number. The following diagram illustrates this message flow between a proposer and the acceptor nodes.

Proposer                    Acceptors
    |                           |
    |--- Prepare(n) ----------->|  Phase 1
    |<-- Promise(n, v?) --------|
    |                           |
    |--- Accept(n, v) --------->|  Phase 2
    |<-- Accepted(n, v) --------|
    |                           |
    |--- Decided(v) ----------->|  Notify learners

Multi-Paxos extends basic Paxos for multiple values by electing a stable leader who can skip the prepare phase for subsequent proposals. This significantly improves performance for sequences of values.

Raft

Raft was designed as an understandable alternative to Paxos. It achieves the same consensus guarantees but with clearer separation of concerns: leader election, log replication, and safety.

In Raft, one node is the leader; all others are followers. Clients send requests to the leader, who appends them to its log and replicates to followers. Once a majority acknowledges, the entry is committed. This flow creates a clear, linear path that's easier to reason about than Paxos.

Leader          Followers
  |                 |
  |<-- Request -----|  Client request
  |                 |
  |-- AppendEntries>|  Replicate log entry
  |<-- Success -----|  Majority acknowledges
  |                 |
  |   [committed]   |  Entry committed
  |                 |
  |-- AppendEntries>|  Notify followers of commit

Leader election occurs when followers don't hear from the leader within an election timeout. A candidate requests votes from other nodes. The first to receive a majority becomes leader. Randomized timeouts prevent split votes.

Raft's log matching property ensures consistency. If two logs contain an entry with the same index and term, all preceding entries are identical. This allows simple consistency checking during replication.

The following Go code shows a simplified Raft node handling AppendEntries RPCs. You can see how the node validates the term, checks log consistency, and updates its state accordingly.

// Simplified Raft state
type RaftNode struct {
    // Persistent state
    currentTerm int
    votedFor    *NodeID
    log         []LogEntry

    // Volatile state
    commitIndex int
    lastApplied int

    // Leader state
    nextIndex  map[NodeID]int
    matchIndex map[NodeID]int
}

func (n *RaftNode) AppendEntries(args AppendEntriesArgs) AppendEntriesReply {
    // Reject if term is old
    if args.Term < n.currentTerm {
        return AppendEntriesReply{Term: n.currentTerm, Success: false}
    }

    // Update term if newer
    if args.Term > n.currentTerm {
        n.currentTerm = args.Term
        n.votedFor = nil
        n.becomeFollower()
    }

    // Check log consistency
    if args.PrevLogIndex > 0 {
        if len(n.log) < args.PrevLogIndex ||
           n.log[args.PrevLogIndex-1].Term != args.PrevLogTerm {
            return AppendEntriesReply{Term: n.currentTerm, Success: false}
        }
    }

    // Append new entries
    n.log = append(n.log[:args.PrevLogIndex], args.Entries...)

    // Update commit index
    if args.LeaderCommit > n.commitIndex {
        n.commitIndex = min(args.LeaderCommit, len(n.log))
    }

    return AppendEntriesReply{Term: n.currentTerm, Success: true}
}

Notice how the node separates persistent state (must survive restarts) from volatile state (can be reconstructed). The persistent state ensures correctness across failures, while volatile state optimizes performance.

Practical Implementations

Real systems use these algorithms in various forms. etcd uses Raft for Kubernetes cluster state. ZooKeeper uses Zab (Zookeeper Atomic Broadcast), similar to Paxos. Consul uses Raft for its key-value store and service catalog.

Configuring consensus systems involves tradeoffs. Cluster size affects fault tolerance and performance. A 3-node cluster tolerates 1 failure; 5 nodes tolerate 2. More nodes increase fault tolerance but slow consensus due to additional communication.

When configuring a consensus-based system like etcd, you'll need to balance timing parameters carefully. The heartbeat interval and election timeout directly affect how quickly the cluster detects failures and elects new leaders.

# etcd cluster configuration
etcd:
  name: node1
  initial-cluster: node1=https://node1:2380,node2=https://node2:2380,node3=https://node3:2380
  initial-cluster-state: new
  initial-cluster-token: my-cluster

  # Timing parameters affect availability/consistency tradeoff
  heartbeat-interval: 100ms
  election-timeout: 1000ms

Heartbeat and election timeouts affect responsiveness and stability. Short timeouts detect failures quickly but risk false positives in slow networks. Long timeouts are more stable but slower to recover from actual failures.

CAP Theorem Implications

The CAP theorem states that distributed systems can provide at most two of three guarantees: consistency, availability, and partition tolerance. Consensus algorithms choose consistency over availability during partitions.

When a network partition occurs, nodes in the minority partition cannot reach consensus because they can't contact a majority. They become unavailable rather than serving potentially inconsistent data. The following diagram shows how a partition affects cluster behavior.

Before partition:
[A]---[B]---[C]---[D]---[E]
     (5 nodes, majority = 3)

After partition:
[A]---[B]  |  [C]---[D]---[E]
(2 nodes)  |  (3 nodes = majority)

Right side can still reach consensus.
Left side cannot - it lacks majority.

This behavior is intentional. Consensus systems prefer unavailability to inconsistency. Applications must handle the case where consensus operations fail during partitions.

Performance Considerations

Consensus performance depends on network latency, disk I/O, and cluster size. Each operation requires communication with a majority of nodes. In a 5-node cluster across three data centers, latency to the farthest data center impacts every operation.

Write operations require disk syncs for durability. The leader must persist the log entry before sending to followers. Followers must persist before acknowledging. These syncs dominate latency in low-latency networks.

Batching improves throughput by amortizing the fixed overhead of consensus across multiple operations. Instead of one consensus round per request, you can batch requests and perform a single consensus round for the entire batch.

// Batching improves throughput
type Batcher struct {
    pending []Request
    timer   *time.Timer

    maxBatch int
    maxWait  time.Duration
}

func (b *Batcher) Submit(req Request) <-chan Result {
    result := make(chan Result, 1)
    b.pending = append(b.pending, requestWithResult{req, result})

    if len(b.pending) >= b.maxBatch {
        b.flush()
    } else if b.timer == nil {
        b.timer = time.AfterFunc(b.maxWait, b.flush)
    }

    return result
}

func (b *Batcher) flush() {
    batch := b.pending
    b.pending = nil
    b.timer = nil

    // Single consensus round for all requests in batch
    b.consensus.Propose(batch)
}

Batching amortizes consensus overhead across multiple operations. Instead of one consensus round per operation, batch multiple operations into single proposals. The tradeoff is increased latency for individual requests, but higher overall throughput.

Conclusion

Distributed consensus algorithms enable agreement despite failures. Paxos provided the theoretical foundation. Raft made consensus understandable and implementable. Both achieve the same guarantees with different approaches.

Consensus systems choose consistency over availability during partitions. They require majority agreement, limiting fault tolerance to less than half the nodes. Performance depends on network latency, disk speed, and cluster size.

Understanding consensus helps you configure systems like etcd, ZooKeeper, and Consul appropriately. Choose cluster sizes that match your fault tolerance needs. Set timeouts that balance responsiveness and stability. Accept that consensus operations may fail during partitions and design applications accordingly.