Distributed Algorithms
Problem framing
Distributed databases need routing stability, fast writes, bounded memory, and safe coordination under failure. No single algorithm solves all of this. Production systems combine probabilistic data structures, storage layouts, and consensus primitives to balance correctness, latency, and cost.
This page explains every algorithm listed in
../ddia-learning-app/ddia_app/algorithms and how they compose into end-to-end data paths.
Core idea / pattern
| Algorithm | Primary role | Typical placement | Cross-link |
|---|---|---|---|
| Consistent Hashing | Stable key-to-node routing with minimal remapping | Shard routers, partition maps | Storage architecture |
| Vector Clocks | Causality tracking for concurrent updates | Multi-writer replicas and conflict resolution | Consistency models |
| Quorum Replication | Read/write consistency tuning via N, R, W | Replicated key-value and document stores | Replication patterns |
| LSM Tree | Write-optimized storage with compaction | High-ingest NoSQL engines | Database internals |
| Raft Election | Leader election by majority vote | Metadata services and control planes | Coordination and commit |
| Bloom Filter | Probabilistic membership test to avoid wasted reads | LSM read path and cache admission | Storage engine internals |
| HyperLogLog | Approximate distinct counting with fixed memory | Cardinality estimation and analytics | Stream and batch analytics |
Architecture diagram
flowchart LR
Client[Client] --> Router[Consistent Hash Router]
Router --> ReplicaA[Replica Set A]
Router --> ReplicaB[Replica Set B]
ReplicaA --> Quorum[Quorum Read and Write]
ReplicaB --> Quorum
Quorum --> LSM[LSM Storage Engine]
LSM --> Bloom[Bloom Filter]
LSM --> HLL[HyperLogLog Counters]
Quorum --> VC[Vector Clock Metadata]
Control[Control Plane] --> Raft[Raft Leader Election]
Raft --> Router
Step-by-step flow
- Route the key with consistent hashing so scaling events remap only a small key subset.
- Write to a replica subset and require quorum acknowledgments for target consistency.
- Persist writes into an LSM memtable, then flush immutable SSTables and compact in background.
- Use Bloom filters during reads to skip SSTables that definitely do not contain the key.
- Attach vector clock metadata when concurrent writes are possible across regions.
- Use Raft election for metadata leadership so membership and configuration updates stay coordinated.
- Track large-scale distinct metrics with HyperLogLog instead of exact sets.
Visual reference: key routing and quorum
sequenceDiagram
participant C as Client
participant R as Hash Router
participant P1 as Replica 1
participant P2 as Replica 2
participant P3 as Replica 3
C->>R: write(key, value)
R->>P1: route by consistent hash
R->>P2: replicate
R->>P3: replicate
P1-->>R: ack
P2-->>R: ack
R-->>C: write success (W quorum reached)
Visual reference: LSM read with Bloom filter
flowchart LR
Read[Read key] --> Mem[Check memtable]
Mem --> Bloom1[Bloom filter SSTable L0]
Bloom1 -->|No| Skip1[Skip table]
Bloom1 -->|Maybe| Probe1[Probe SSTable]
Probe1 --> Bloom2[Bloom filter SSTable L1]
Bloom2 -->|No| Skip2[Skip table]
Bloom2 -->|Maybe| Probe2[Probe SSTable]
Probe2 --> Result[Return newest value]
Visual reference: Raft election
sequenceDiagram
participant N1 as Node 1
participant N2 as Node 2
participant N3 as Node 3
participant N4 as Node 4
participant N5 as Node 5
N1->>N2: RequestVote(term=9)
N1->>N3: RequestVote(term=9)
N1->>N4: RequestVote(term=9)
N1->>N5: RequestVote(term=9)
N2-->>N1: vote granted
N3-->>N1: vote granted
N4-->>N1: vote granted
N1-->>N1: majority reached, leader elected
Algorithm deep dive
| Algorithm | Problem | Pattern | Trade-offs | Failure modes |
|---|---|---|---|---|
| Consistent Hashing | Node adds/removals causing full key reshuffles | Hash ring with virtual nodes and clockwise ownership | Low remap cost vs imperfect balance without enough virtual nodes | Hot partitions from skewed keys or poor vnode configuration |
| Vector Clocks | Concurrent multi-writer updates with unclear causal order | Per-writer counters merged by max and compared by dominance | Causality clarity vs metadata growth with writer count | Unbounded clock size and frequent conflict branches |
| Quorum Replication | Balancing availability and consistency across replicas | Tunable N/R/W with read-repair and anti-entropy | Flexible guarantees vs quorum latency and partial failures | Stale reads when quorum math is weak or repairs lag |
| LSM Tree | High write rates overwhelm random-write storage | Memtable + immutable SSTables + background compaction | Excellent ingest vs read/write amplification from compaction | Compaction backlog, tombstone buildup, and tail-latency spikes |
| Raft Election | No safe coordinator for metadata and critical decisions | Term-based majority voting to elect one leader | Strong leadership safety vs election pauses during instability | Leader flapping and progress stalls under repeated timeouts |
| Bloom Filter | Too many negative disk lookups on read paths | Bitset with multiple hashes for probabilistic membership tests | Tiny memory footprint vs false positives | Bad sizing causes high false-positive rates and wasted I/O |
| HyperLogLog | Exact distinct counts are too expensive at scale | Register-based cardinality estimation from hash leading zeros | Fixed memory vs approximate results | Using low precision where exact billing-grade counts are required |
Failure modes
- Consistent hashing with too few virtual nodes creates uneven shard utilization.
- Quorum settings with
R + W <= Nallow stale reads after acknowledged writes. - LSM compaction debt causes read amplification and tail-latency spikes.
- Bloom filters sized too small increase false positives and wasted disk probes.
- Vector clocks grow in size with many writers and can increase conflict metadata overhead.
- Raft leader flapping under unstable networks can stall control-plane progress.
- HyperLogLog misuse for low-cardinality sets can produce avoidable estimation error.
Trade-offs
| Algorithm | What it optimizes | What it gives up |
|---|---|---|
| Consistent Hashing | Minimal remapping during rebalancing | Perfectly even distribution without tuning |
| Vector Clocks | Causality visibility | Metadata size and merge complexity |
| Quorum Replication | Tunable consistency/availability balance | Higher latency at stronger quorum levels |
| LSM Tree | High write throughput | Compaction cost and read amplification |
| Raft Election | Safe single-leader coordination | Majority dependency and election pauses |
| Bloom Filter | Small-memory negative lookups | False positives and no deletion in basic form |
| HyperLogLog | Low-memory cardinality estimates | Approximate answers instead of exact counts |
Real-world usage
| Production scenario | Algorithm combination | Reason |
|---|---|---|
| Distributed key-value store | Consistent Hashing + Quorum + Vector Clocks | Stable routing with tunable consistency and conflict tracking |
| Write-heavy document store | LSM Tree + Bloom Filter + Compaction | Sustains ingest while containing read cost |
| Cluster metadata service | Raft Election + majority quorum | Keeps leadership and config changes coherent |
| Large-scale analytics | HyperLogLog + exact drill-down on demand | Low-cost approximate cardinality at high scale |