Distributed Transactions

Problem framing

Once data is sharded or split across services, a single-node transaction cannot enforce invariants end to end. Distributed transactions coordinate multiple participants so updates are atomic and isolated, but every extra round trip adds latency and reduces availability under failure. This page focuses on coordination protocols, while storage layout lives in storage patterns and engine internals live in database systems internals.

Core idea / pattern

Two-phase commit (2PC)

Problem: commit a transaction across multiple shards without partial writes. Pattern: a coordinator asks participants to prepare (durable intent + locks), then issues a global commit only if all vote yes.

Three-phase commit (3PC)

Problem: avoid blocking when the coordinator fails after a prepare. Pattern: add a pre-commit phase that lets participants decide commit without the coordinator, assuming bounded network delays.

Consensus-backed transactions

Problem: keep decisions durable even if leaders fail or replicas restart. Pattern: each shard is a consensus group (Raft or Paxos), and transaction intents plus commit decisions are replicated before being applied, often with 2PC at the coordinator.

Saga workflows

Problem: complete long-running, multi-service workflows without locking resources. Pattern: sequence local transactions and rely on compensating actions for rollback, trading strict atomicity for availability.

Deterministic ordering (Calvin-style)

Problem: avoid distributed deadlocks and reduce coordination on commit. Pattern: a sequencer orders transactions ahead of execution, then participants apply updates in that order without waiting on two-phase prepare rounds.

Approach	Guarantee focus	Latency impact	Failure behavior
2PC	Atomic commit	Two coordinator rounds	Blocking on coordinator loss
3PC	Atomic commit	Three coordinator rounds	Assumes bounded delays
Consensus + 2PC	Atomic commit + durability	Extra quorum writes	Survives replica loss
Sagas	Eventual consistency	Variable, async steps	Compensation can fail

When to use sagas vs 2PC

Choice	Best fit	Why it works	Watch outs
2PC	Short, transactional updates across a few shards	Atomic commit with strong invariants	Coordinator loss blocks progress
Sagas	Long-running workflows across services	No global locks, higher availability	Compensation may be incomplete

Architecture diagram

Coordinator and participants

flowchart LR
  Client[Client] --> Coord[Txn Coordinator]
  Coord --> A[Shard A: Txn + Lock Manager]
  Coord --> B[Shard B: Txn + Lock Manager]
  Coord --> C[Shard C: Txn + Lock Manager]
  A --> LogA[Replicated Log A]
  B --> LogB[Replicated Log B]
  C --> LogC[Replicated Log C]
  Coord --> TxnLog[Coordinator Log]

Step-by-step flow

The client submits a multi-shard transaction to the coordinator.
The coordinator identifies participant shards from the routing map.
Each participant writes a durable prepare record and locks or versions the affected keys.
If all participants vote yes, the coordinator persists the commit decision.
Participants apply changes, release locks, and acknowledge completion.
Any no vote or timeout triggers a global abort and rollback.

2PC commit dataflow

sequenceDiagram
  participant Client
  participant Coord as Coordinator
  participant A as Shard A
  participant B as Shard B
  participant LogA as Log A
  participant LogB as Log B
  participant TLog as Txn Log

  Client->>Coord: Begin transaction
  Coord->>A: Prepare
  A->>LogA: Write prepare record
  LogA-->>A: Fsync ack
  A-->>Coord: Vote yes
  Coord->>B: Prepare
  B->>LogB: Write prepare record
  LogB-->>B: Fsync ack
  B-->>Coord: Vote yes
  Coord->>TLog: Write commit decision
  TLog-->>Coord: Commit durable
  Coord->>A: Commit
  Coord->>B: Commit
  A-->>Coord: Ack
  B-->>Coord: Ack
  Coord-->>Client: Commit ack

Consensus-backed commit dataflow

sequenceDiagram
  participant Client
  participant Coord as Coordinator
  participant RaftA as Shard A Raft
  participant RaftB as Shard B Raft
  participant Txn as Txn Log Raft

  Client->>Coord: Begin transaction
  Coord->>RaftA: Replicate intent
  RaftA-->>Coord: Quorum ack
  Coord->>RaftB: Replicate intent
  RaftB-->>Coord: Quorum ack
  Coord->>Txn: Replicate commit decision
  Txn-->>Coord: Quorum ack
  Coord->>RaftA: Apply commit
  Coord->>RaftB: Apply commit
  Coord-->>Client: Commit ack

Saga compensation dataflow

sequenceDiagram
  participant Client
  participant Saga as Saga Orchestrator
  participant A as Service A
  participant B as Service B
  participant C as Service C

  Client->>Saga: Start workflow
  Saga->>A: Execute step A
  A-->>Saga: Step A ok
  Saga->>B: Execute step B
  B-->>Saga: Step B ok
  Saga->>C: Execute step C
  C-->>Saga: Step C failed
  Saga->>B: Compensate step B
  B-->>Saga: Compensation ok
  Saga->>A: Compensate step A
  A-->>Saga: Compensation ok
  Saga-->>Client: Workflow failed

Deterministic ordering dataflow

sequenceDiagram
  participant Client
  participant Seq as Sequencer
  participant A as Shard A
  participant B as Shard B
  participant LogA as Log A
  participant LogB as Log B

  Client->>Seq: Submit transaction
  Seq->>Seq: Assign global order
  Seq->>A: Order entry
  Seq->>B: Order entry
  A->>LogA: Append ordered txn
  B->>LogB: Append ordered txn
  LogA-->>A: Ack
  LogB-->>B: Ack
  A->>A: Execute in order
  B->>B: Execute in order
  A-->>Seq: Ack commit
  B-->>Seq: Ack commit
  Seq-->>Client: Commit result

Coordinator crash recovery timeline

sequenceDiagram
  participant Coord as Coordinator
  participant A as Shard A
  participant B as Shard B
  participant Log as Txn Log

  Coord->>A: Prepare
  Coord->>B: Prepare
  A-->>Coord: Vote yes
  B-->>Coord: Vote yes
  Coord--x Coord: Crash before commit
  A->>Log: Query decision
  B->>Log: Query decision
  Log-->>A: No decision found
  Log-->>B: No decision found
  A-->>B: Timeout triggers abort
  B-->>A: Rollback prepared state

Failure modes

Coordinator loss after prepare blocks participants until recovery or manual resolution.
Slow or failed participants hold locks and drive tail latency.
Replica quorum loss stalls commits even if the coordinator is healthy.
Clock skew or network partitions invalidate 3PC timing assumptions.
Compensating actions in sagas can fail, leaving partial outcomes.

Trade-offs

Atomicity requires extra log writes and coordination round trips.
Higher isolation improves correctness but extends lock or version lifetimes.
Consensus-backed commits reduce data loss but add quorum latency.
Sagas improve availability but move consistency management into application logic.
Cross-region coordination increases tail latency unless locality is enforced.

Real-world usage

Spanner and CockroachDB use 2PC across consensus-backed shards for strong consistency.
MySQL XA and Postgres logical transaction managers support 2PC in single clusters.
YugabyteDB and TiDB coordinate distributed transactions over Raft or Paxos groups.
Microservice workflows often use sagas with an outbox or event log.
For distributed consistency guarantees, see consistency models.