System Design Foundations

Problem framing

Real systems are constrained by latency, throughput, correctness, and failure recovery at the same time. A design that looks clean at low traffic often breaks under bursty load, partial outages, or data growth. The goal is to choose simple components first, then add coordination only where correctness or scale requires it.

Core idea / pattern

Start with clear workload boundaries, then map each boundary to compute, storage, and networking decisions. Use this page as the control plane and deep-dive into compute, storage, database internals, and consistency models.

Dimension	Problem	Pattern	Trade-offs	Failure modes
Request handling	Unpredictable traffic and latency spikes	Stateless services behind load balancers	Simplicity vs cross-service chatter	Retry storms, queue buildup, tail latency
Data ownership	Shared mutable state across services	Per-domain data ownership with explicit APIs	Autonomy vs cross-domain joins	Inconsistent reads, duplicate writes, drift
Reliability	Node or zone failures	Replication, quorum, health-based failover	Higher durability vs write latency	Split brain, stale replicas, quorum loss
Coordination	Conflicting updates and leadership decisions	Consensus-backed metadata and leases	Stronger guarantees vs coordination overhead	Leader flapping, lock leaks, stalled progress

Architecture diagram

flowchart LR
  Client[Client] --> Edge[Edge Gateway]
  Edge --> LB[Load Balancer]
  LB --> API1[Service A]
  LB --> API2[Service B]
  API1 --> Cache[(Cache)]
  API2 --> Cache
  API1 --> DB[(Primary Database)]
  API2 --> DB
  DB --> Repl[(Replicas)]
  API1 --> MQ[(Event Log / Queue)]
  API2 --> MQ
  MQ --> Worker[Async Workers]
  Worker --> DB

Step-by-step flow

Define product SLOs: p95 latency, availability target, and correctness invariants.
Model core entities and assign a clear owner for each write path.
Choose compute topology: synchronous APIs for user paths, async workers for long tasks.
Select storage shape based on query pattern and consistency requirements.
Add replication and failover with explicit read and write consistency goals.
Instrument latency, saturation, and error rates before scaling decisions.

Request lifecycle reference

sequenceDiagram
  participant U as User
  participant G as Gateway
  participant S as Service
  participant C as Cache
  participant D as Database
  participant Q as Queue

  U->>G: Request
  G->>S: Route with auth and limits
  S->>C: Read-through cache lookup
  C-->>S: Hit or miss
  S->>D: Query or write
  D-->>S: Result
  S->>Q: Emit async event
  S-->>U: Response

Failure modes

Shared databases without ownership boundaries create coupling and unsafe schema changes.
Synchronous chains with many dependencies amplify latency and outage blast radius.
No backpressure policy leads to queue growth, timeout cascades, and partial data loss.
Region failover without tested runbooks causes extended recovery times.
Weak observability hides slow degradation until user-visible failure occurs.

Trade-offs

Decision	Benefit	Cost
Monolith first	Fast delivery and simpler ops	Harder independent scaling later
Microservices early	Clear domain boundaries	Network and observability complexity
Strong consistency	Simpler correctness reasoning	Higher coordination latency
Eventual consistency	High availability and throughput	Conflict handling and stale reads

Real-world usage

System type	Common architecture choices	Primary concern
E-commerce	Stateless APIs + relational core + async inventory workflows	Correctness for orders and payments
Messaging	Partitioned logs + fanout workers + presence cache	Low latency and durability
Analytics platform	Streaming ingest + columnar storage + batch compute	Throughput and cost efficiency
AI retrieval service	Metadata store + vector index + background embedding updates	Recall, freshness, and tail latency