Distributed Transactions

Problem framing

Once data is sharded or split across services, a single-node transaction cannot enforce invariants end to end. Distributed transactions coordinate multiple participants so updates are atomic and isolated, but every extra round trip adds latency and reduces availability under failure. This page focuses on coordination protocols, while storage layout lives in storage patterns and engine internals live in database systems internals.

Core idea / pattern

Two-phase commit (2PC)

Problem: commit a transaction across multiple shards without partial writes. Pattern: a coordinator asks participants to prepare (durable intent + locks), then issues a global commit only if all vote yes.

Three-phase commit (3PC)

Problem: avoid blocking when the coordinator fails after a prepare. Pattern: add a pre-commit phase that lets participants decide commit without the coordinator, assuming bounded network delays.

Consensus-backed transactions

Problem: keep decisions durable even if leaders fail or replicas restart. Pattern: each shard is a consensus group (Raft or Paxos), and transaction intents plus commit decisions are replicated before being applied, often with 2PC at the coordinator.

Saga workflows

Problem: complete long-running, multi-service workflows without locking resources. Pattern: sequence local transactions and rely on compensating actions for rollback, trading strict atomicity for availability.

Deterministic ordering (Calvin-style)

Problem: avoid distributed deadlocks and reduce coordination on commit. Pattern: a sequencer orders transactions ahead of execution, then participants apply updates in that order without waiting on two-phase prepare rounds.

Approach Guarantee focus Latency impact Failure behavior
2PC Atomic commit Two coordinator rounds Blocking on coordinator loss
3PC Atomic commit Three coordinator rounds Assumes bounded delays
Consensus + 2PC Atomic commit + durability Extra quorum writes Survives replica loss
Sagas Eventual consistency Variable, async steps Compensation can fail

When to use sagas vs 2PC

Choice Best fit Why it works Watch outs
2PC Short, transactional updates across a few shards Atomic commit with strong invariants Coordinator loss blocks progress
Sagas Long-running workflows across services No global locks, higher availability Compensation may be incomplete

Architecture diagram

Coordinator and participants

flowchart LR
  Client[Client] --> Coord[Txn Coordinator]
  Coord --> A[Shard A: Txn + Lock Manager]
  Coord --> B[Shard B: Txn + Lock Manager]
  Coord --> C[Shard C: Txn + Lock Manager]
  A --> LogA[Replicated Log A]
  B --> LogB[Replicated Log B]
  C --> LogC[Replicated Log C]
  Coord --> TxnLog[Coordinator Log]
        

Step-by-step flow

  1. The client submits a multi-shard transaction to the coordinator.
  2. The coordinator identifies participant shards from the routing map.
  3. Each participant writes a durable prepare record and locks or versions the affected keys.
  4. If all participants vote yes, the coordinator persists the commit decision.
  5. Participants apply changes, release locks, and acknowledge completion.
  6. Any no vote or timeout triggers a global abort and rollback.

2PC commit dataflow

sequenceDiagram
  participant Client
  participant Coord as Coordinator
  participant A as Shard A
  participant B as Shard B
  participant LogA as Log A
  participant LogB as Log B
  participant TLog as Txn Log

  Client->>Coord: Begin transaction
  Coord->>A: Prepare
  A->>LogA: Write prepare record
  LogA-->>A: Fsync ack
  A-->>Coord: Vote yes
  Coord->>B: Prepare
  B->>LogB: Write prepare record
  LogB-->>B: Fsync ack
  B-->>Coord: Vote yes
  Coord->>TLog: Write commit decision
  TLog-->>Coord: Commit durable
  Coord->>A: Commit
  Coord->>B: Commit
  A-->>Coord: Ack
  B-->>Coord: Ack
  Coord-->>Client: Commit ack
        

Consensus-backed commit dataflow

sequenceDiagram
  participant Client
  participant Coord as Coordinator
  participant RaftA as Shard A Raft
  participant RaftB as Shard B Raft
  participant Txn as Txn Log Raft

  Client->>Coord: Begin transaction
  Coord->>RaftA: Replicate intent
  RaftA-->>Coord: Quorum ack
  Coord->>RaftB: Replicate intent
  RaftB-->>Coord: Quorum ack
  Coord->>Txn: Replicate commit decision
  Txn-->>Coord: Quorum ack
  Coord->>RaftA: Apply commit
  Coord->>RaftB: Apply commit
  Coord-->>Client: Commit ack
        

Saga compensation dataflow

sequenceDiagram
  participant Client
  participant Saga as Saga Orchestrator
  participant A as Service A
  participant B as Service B
  participant C as Service C

  Client->>Saga: Start workflow
  Saga->>A: Execute step A
  A-->>Saga: Step A ok
  Saga->>B: Execute step B
  B-->>Saga: Step B ok
  Saga->>C: Execute step C
  C-->>Saga: Step C failed
  Saga->>B: Compensate step B
  B-->>Saga: Compensation ok
  Saga->>A: Compensate step A
  A-->>Saga: Compensation ok
  Saga-->>Client: Workflow failed
        

Deterministic ordering dataflow

sequenceDiagram
  participant Client
  participant Seq as Sequencer
  participant A as Shard A
  participant B as Shard B
  participant LogA as Log A
  participant LogB as Log B

  Client->>Seq: Submit transaction
  Seq->>Seq: Assign global order
  Seq->>A: Order entry
  Seq->>B: Order entry
  A->>LogA: Append ordered txn
  B->>LogB: Append ordered txn
  LogA-->>A: Ack
  LogB-->>B: Ack
  A->>A: Execute in order
  B->>B: Execute in order
  A-->>Seq: Ack commit
  B-->>Seq: Ack commit
  Seq-->>Client: Commit result
        

Coordinator crash recovery timeline

sequenceDiagram
  participant Coord as Coordinator
  participant A as Shard A
  participant B as Shard B
  participant Log as Txn Log

  Coord->>A: Prepare
  Coord->>B: Prepare
  A-->>Coord: Vote yes
  B-->>Coord: Vote yes
  Coord--x Coord: Crash before commit
  A->>Log: Query decision
  B->>Log: Query decision
  Log-->>A: No decision found
  Log-->>B: No decision found
  A-->>B: Timeout triggers abort
  B-->>A: Rollback prepared state
        

Failure modes

Trade-offs

Real-world usage