Distributed Algorithms

Problem framing

Distributed databases need routing stability, fast writes, bounded memory, and safe coordination under failure. No single algorithm solves all of this. Production systems combine probabilistic data structures, storage layouts, and consensus primitives to balance correctness, latency, and cost.

This page explains every algorithm listed in ../ddia-learning-app/ddia_app/algorithms and how they compose into end-to-end data paths.

Core idea / pattern

Algorithm	Primary role	Typical placement	Cross-link
Consistent Hashing	Stable key-to-node routing with minimal remapping	Shard routers, partition maps	Storage architecture
Vector Clocks	Causality tracking for concurrent updates	Multi-writer replicas and conflict resolution	Consistency models
Quorum Replication	Read/write consistency tuning via N, R, W	Replicated key-value and document stores	Replication patterns
LSM Tree	Write-optimized storage with compaction	High-ingest NoSQL engines	Database internals
Raft Election	Leader election by majority vote	Metadata services and control planes	Coordination and commit
Bloom Filter	Probabilistic membership test to avoid wasted reads	LSM read path and cache admission	Storage engine internals
HyperLogLog	Approximate distinct counting with fixed memory	Cardinality estimation and analytics	Stream and batch analytics

Architecture diagram

flowchart LR
  Client[Client] --> Router[Consistent Hash Router]
  Router --> ReplicaA[Replica Set A]
  Router --> ReplicaB[Replica Set B]
  ReplicaA --> Quorum[Quorum Read and Write]
  ReplicaB --> Quorum
  Quorum --> LSM[LSM Storage Engine]
  LSM --> Bloom[Bloom Filter]
  LSM --> HLL[HyperLogLog Counters]
  Quorum --> VC[Vector Clock Metadata]
  Control[Control Plane] --> Raft[Raft Leader Election]
  Raft --> Router

Step-by-step flow

Route the key with consistent hashing so scaling events remap only a small key subset.
Write to a replica subset and require quorum acknowledgments for target consistency.
Persist writes into an LSM memtable, then flush immutable SSTables and compact in background.
Use Bloom filters during reads to skip SSTables that definitely do not contain the key.
Attach vector clock metadata when concurrent writes are possible across regions.
Use Raft election for metadata leadership so membership and configuration updates stay coordinated.
Track large-scale distinct metrics with HyperLogLog instead of exact sets.

Visual reference: key routing and quorum

sequenceDiagram
  participant C as Client
  participant R as Hash Router
  participant P1 as Replica 1
  participant P2 as Replica 2
  participant P3 as Replica 3

  C->>R: write(key, value)
  R->>P1: route by consistent hash
  R->>P2: replicate
  R->>P3: replicate
  P1-->>R: ack
  P2-->>R: ack
  R-->>C: write success (W quorum reached)

Visual reference: LSM read with Bloom filter

flowchart LR
  Read[Read key] --> Mem[Check memtable]
  Mem --> Bloom1[Bloom filter SSTable L0]
  Bloom1 -->|No| Skip1[Skip table]
  Bloom1 -->|Maybe| Probe1[Probe SSTable]
  Probe1 --> Bloom2[Bloom filter SSTable L1]
  Bloom2 -->|No| Skip2[Skip table]
  Bloom2 -->|Maybe| Probe2[Probe SSTable]
  Probe2 --> Result[Return newest value]

Visual reference: Raft election

sequenceDiagram
  participant N1 as Node 1
  participant N2 as Node 2
  participant N3 as Node 3
  participant N4 as Node 4
  participant N5 as Node 5

  N1->>N2: RequestVote(term=9)
  N1->>N3: RequestVote(term=9)
  N1->>N4: RequestVote(term=9)
  N1->>N5: RequestVote(term=9)
  N2-->>N1: vote granted
  N3-->>N1: vote granted
  N4-->>N1: vote granted
  N1-->>N1: majority reached, leader elected

Algorithm deep dive

Algorithm	Problem	Pattern	Trade-offs	Failure modes
Consistent Hashing	Node adds/removals causing full key reshuffles	Hash ring with virtual nodes and clockwise ownership	Low remap cost vs imperfect balance without enough virtual nodes	Hot partitions from skewed keys or poor vnode configuration
Vector Clocks	Concurrent multi-writer updates with unclear causal order	Per-writer counters merged by max and compared by dominance	Causality clarity vs metadata growth with writer count	Unbounded clock size and frequent conflict branches
Quorum Replication	Balancing availability and consistency across replicas	Tunable N/R/W with read-repair and anti-entropy	Flexible guarantees vs quorum latency and partial failures	Stale reads when quorum math is weak or repairs lag
LSM Tree	High write rates overwhelm random-write storage	Memtable + immutable SSTables + background compaction	Excellent ingest vs read/write amplification from compaction	Compaction backlog, tombstone buildup, and tail-latency spikes
Raft Election	No safe coordinator for metadata and critical decisions	Term-based majority voting to elect one leader	Strong leadership safety vs election pauses during instability	Leader flapping and progress stalls under repeated timeouts
Bloom Filter	Too many negative disk lookups on read paths	Bitset with multiple hashes for probabilistic membership tests	Tiny memory footprint vs false positives	Bad sizing causes high false-positive rates and wasted I/O
HyperLogLog	Exact distinct counts are too expensive at scale	Register-based cardinality estimation from hash leading zeros	Fixed memory vs approximate results	Using low precision where exact billing-grade counts are required

Failure modes

Consistent hashing with too few virtual nodes creates uneven shard utilization.
Quorum settings with R + W <= N allow stale reads after acknowledged writes.
LSM compaction debt causes read amplification and tail-latency spikes.
Bloom filters sized too small increase false positives and wasted disk probes.
Vector clocks grow in size with many writers and can increase conflict metadata overhead.
Raft leader flapping under unstable networks can stall control-plane progress.
HyperLogLog misuse for low-cardinality sets can produce avoidable estimation error.

Trade-offs

Algorithm	What it optimizes	What it gives up
Consistent Hashing	Minimal remapping during rebalancing	Perfectly even distribution without tuning
Vector Clocks	Causality visibility	Metadata size and merge complexity
Quorum Replication	Tunable consistency/availability balance	Higher latency at stronger quorum levels
LSM Tree	High write throughput	Compaction cost and read amplification
Raft Election	Safe single-leader coordination	Majority dependency and election pauses
Bloom Filter	Small-memory negative lookups	False positives and no deletion in basic form
HyperLogLog	Low-memory cardinality estimates	Approximate answers instead of exact counts

Real-world usage

Production scenario	Algorithm combination	Reason
Distributed key-value store	Consistent Hashing + Quorum + Vector Clocks	Stable routing with tunable consistency and conflict tracking
Write-heavy document store	LSM Tree + Bloom Filter + Compaction	Sustains ingest while containing read cost
Cluster metadata service	Raft Election + majority quorum	Keeps leadership and config changes coherent
Large-scale analytics	HyperLogLog + exact drill-down on demand	Low-cost approximate cardinality at high scale