Distributed Transactions
Problem framing
Once data is sharded or split across services, a single-node transaction cannot enforce invariants end to end. Distributed transactions coordinate multiple participants so updates are atomic and isolated, but every extra round trip adds latency and reduces availability under failure. This page focuses on coordination protocols, while storage layout lives in storage patterns and engine internals live in database systems internals.
Core idea / pattern
Two-phase commit (2PC)
Problem: commit a transaction across multiple shards without partial writes. Pattern: a coordinator asks participants to prepare (durable intent + locks), then issues a global commit only if all vote yes.
Three-phase commit (3PC)
Problem: avoid blocking when the coordinator fails after a prepare. Pattern: add a pre-commit phase that lets participants decide commit without the coordinator, assuming bounded network delays.
Consensus-backed transactions
Problem: keep decisions durable even if leaders fail or replicas restart. Pattern: each shard is a consensus group (Raft or Paxos), and transaction intents plus commit decisions are replicated before being applied, often with 2PC at the coordinator.
Saga workflows
Problem: complete long-running, multi-service workflows without locking resources. Pattern: sequence local transactions and rely on compensating actions for rollback, trading strict atomicity for availability.
Deterministic ordering (Calvin-style)
Problem: avoid distributed deadlocks and reduce coordination on commit. Pattern: a sequencer orders transactions ahead of execution, then participants apply updates in that order without waiting on two-phase prepare rounds.
| Approach | Guarantee focus | Latency impact | Failure behavior |
|---|---|---|---|
| 2PC | Atomic commit | Two coordinator rounds | Blocking on coordinator loss |
| 3PC | Atomic commit | Three coordinator rounds | Assumes bounded delays |
| Consensus + 2PC | Atomic commit + durability | Extra quorum writes | Survives replica loss |
| Sagas | Eventual consistency | Variable, async steps | Compensation can fail |
When to use sagas vs 2PC
| Choice | Best fit | Why it works | Watch outs |
|---|---|---|---|
| 2PC | Short, transactional updates across a few shards | Atomic commit with strong invariants | Coordinator loss blocks progress |
| Sagas | Long-running workflows across services | No global locks, higher availability | Compensation may be incomplete |
Architecture diagram
Coordinator and participants
flowchart LR
Client[Client] --> Coord[Txn Coordinator]
Coord --> A[Shard A: Txn + Lock Manager]
Coord --> B[Shard B: Txn + Lock Manager]
Coord --> C[Shard C: Txn + Lock Manager]
A --> LogA[Replicated Log A]
B --> LogB[Replicated Log B]
C --> LogC[Replicated Log C]
Coord --> TxnLog[Coordinator Log]
Step-by-step flow
- The client submits a multi-shard transaction to the coordinator.
- The coordinator identifies participant shards from the routing map.
- Each participant writes a durable prepare record and locks or versions the affected keys.
- If all participants vote yes, the coordinator persists the commit decision.
- Participants apply changes, release locks, and acknowledge completion.
- Any no vote or timeout triggers a global abort and rollback.
2PC commit dataflow
sequenceDiagram
participant Client
participant Coord as Coordinator
participant A as Shard A
participant B as Shard B
participant LogA as Log A
participant LogB as Log B
participant TLog as Txn Log
Client->>Coord: Begin transaction
Coord->>A: Prepare
A->>LogA: Write prepare record
LogA-->>A: Fsync ack
A-->>Coord: Vote yes
Coord->>B: Prepare
B->>LogB: Write prepare record
LogB-->>B: Fsync ack
B-->>Coord: Vote yes
Coord->>TLog: Write commit decision
TLog-->>Coord: Commit durable
Coord->>A: Commit
Coord->>B: Commit
A-->>Coord: Ack
B-->>Coord: Ack
Coord-->>Client: Commit ack
Consensus-backed commit dataflow
sequenceDiagram
participant Client
participant Coord as Coordinator
participant RaftA as Shard A Raft
participant RaftB as Shard B Raft
participant Txn as Txn Log Raft
Client->>Coord: Begin transaction
Coord->>RaftA: Replicate intent
RaftA-->>Coord: Quorum ack
Coord->>RaftB: Replicate intent
RaftB-->>Coord: Quorum ack
Coord->>Txn: Replicate commit decision
Txn-->>Coord: Quorum ack
Coord->>RaftA: Apply commit
Coord->>RaftB: Apply commit
Coord-->>Client: Commit ack
Saga compensation dataflow
sequenceDiagram
participant Client
participant Saga as Saga Orchestrator
participant A as Service A
participant B as Service B
participant C as Service C
Client->>Saga: Start workflow
Saga->>A: Execute step A
A-->>Saga: Step A ok
Saga->>B: Execute step B
B-->>Saga: Step B ok
Saga->>C: Execute step C
C-->>Saga: Step C failed
Saga->>B: Compensate step B
B-->>Saga: Compensation ok
Saga->>A: Compensate step A
A-->>Saga: Compensation ok
Saga-->>Client: Workflow failed
Deterministic ordering dataflow
sequenceDiagram
participant Client
participant Seq as Sequencer
participant A as Shard A
participant B as Shard B
participant LogA as Log A
participant LogB as Log B
Client->>Seq: Submit transaction
Seq->>Seq: Assign global order
Seq->>A: Order entry
Seq->>B: Order entry
A->>LogA: Append ordered txn
B->>LogB: Append ordered txn
LogA-->>A: Ack
LogB-->>B: Ack
A->>A: Execute in order
B->>B: Execute in order
A-->>Seq: Ack commit
B-->>Seq: Ack commit
Seq-->>Client: Commit result
Coordinator crash recovery timeline
sequenceDiagram
participant Coord as Coordinator
participant A as Shard A
participant B as Shard B
participant Log as Txn Log
Coord->>A: Prepare
Coord->>B: Prepare
A-->>Coord: Vote yes
B-->>Coord: Vote yes
Coord--x Coord: Crash before commit
A->>Log: Query decision
B->>Log: Query decision
Log-->>A: No decision found
Log-->>B: No decision found
A-->>B: Timeout triggers abort
B-->>A: Rollback prepared state
Failure modes
- Coordinator loss after prepare blocks participants until recovery or manual resolution.
- Slow or failed participants hold locks and drive tail latency.
- Replica quorum loss stalls commits even if the coordinator is healthy.
- Clock skew or network partitions invalidate 3PC timing assumptions.
- Compensating actions in sagas can fail, leaving partial outcomes.
Trade-offs
- Atomicity requires extra log writes and coordination round trips.
- Higher isolation improves correctness but extends lock or version lifetimes.
- Consensus-backed commits reduce data loss but add quorum latency.
- Sagas improve availability but move consistency management into application logic.
- Cross-region coordination increases tail latency unless locality is enforced.
Real-world usage
- Spanner and CockroachDB use 2PC across consensus-backed shards for strong consistency.
- MySQL XA and Postgres logical transaction managers support 2PC in single clusters.
- YugabyteDB and TiDB coordinate distributed transactions over Raft or Paxos groups.
- Microservice workflows often use sagas with an outbox or event log.
- For distributed consistency guarantees, see consistency models.