System Design Foundations
Problem framing
Real systems are constrained by latency, throughput, correctness, and failure recovery at the same time. A design that looks clean at low traffic often breaks under bursty load, partial outages, or data growth. The goal is to choose simple components first, then add coordination only where correctness or scale requires it.
Core idea / pattern
Start with clear workload boundaries, then map each boundary to compute, storage, and networking decisions. Use this page as the control plane and deep-dive into compute, storage, database internals, and consistency models.
| Dimension | Problem | Pattern | Trade-offs | Failure modes |
|---|---|---|---|---|
| Request handling | Unpredictable traffic and latency spikes | Stateless services behind load balancers | Simplicity vs cross-service chatter | Retry storms, queue buildup, tail latency |
| Data ownership | Shared mutable state across services | Per-domain data ownership with explicit APIs | Autonomy vs cross-domain joins | Inconsistent reads, duplicate writes, drift |
| Reliability | Node or zone failures | Replication, quorum, health-based failover | Higher durability vs write latency | Split brain, stale replicas, quorum loss |
| Coordination | Conflicting updates and leadership decisions | Consensus-backed metadata and leases | Stronger guarantees vs coordination overhead | Leader flapping, lock leaks, stalled progress |
Architecture diagram
flowchart LR
Client[Client] --> Edge[Edge Gateway]
Edge --> LB[Load Balancer]
LB --> API1[Service A]
LB --> API2[Service B]
API1 --> Cache[(Cache)]
API2 --> Cache
API1 --> DB[(Primary Database)]
API2 --> DB
DB --> Repl[(Replicas)]
API1 --> MQ[(Event Log / Queue)]
API2 --> MQ
MQ --> Worker[Async Workers]
Worker --> DB
Step-by-step flow
- Define product SLOs: p95 latency, availability target, and correctness invariants.
- Model core entities and assign a clear owner for each write path.
- Choose compute topology: synchronous APIs for user paths, async workers for long tasks.
- Select storage shape based on query pattern and consistency requirements.
- Add replication and failover with explicit read and write consistency goals.
- Instrument latency, saturation, and error rates before scaling decisions.
Request lifecycle reference
sequenceDiagram
participant U as User
participant G as Gateway
participant S as Service
participant C as Cache
participant D as Database
participant Q as Queue
U->>G: Request
G->>S: Route with auth and limits
S->>C: Read-through cache lookup
C-->>S: Hit or miss
S->>D: Query or write
D-->>S: Result
S->>Q: Emit async event
S-->>U: Response
Failure modes
- Shared databases without ownership boundaries create coupling and unsafe schema changes.
- Synchronous chains with many dependencies amplify latency and outage blast radius.
- No backpressure policy leads to queue growth, timeout cascades, and partial data loss.
- Region failover without tested runbooks causes extended recovery times.
- Weak observability hides slow degradation until user-visible failure occurs.
Trade-offs
| Decision | Benefit | Cost |
|---|---|---|
| Monolith first | Fast delivery and simpler ops | Harder independent scaling later |
| Microservices early | Clear domain boundaries | Network and observability complexity |
| Strong consistency | Simpler correctness reasoning | Higher coordination latency |
| Eventual consistency | High availability and throughput | Conflict handling and stale reads |
Real-world usage
| System type | Common architecture choices | Primary concern |
|---|---|---|
| E-commerce | Stateless APIs + relational core + async inventory workflows | Correctness for orders and payments |
| Messaging | Partitioned logs + fanout workers + presence cache | Low latency and durability |
| Analytics platform | Streaming ingest + columnar storage + batch compute | Throughput and cost efficiency |
| AI retrieval service | Metadata store + vector index + background embedding updates | Recall, freshness, and tail latency |