System Design Foundations

Problem framing

Real systems are constrained by latency, throughput, correctness, and failure recovery at the same time. A design that looks clean at low traffic often breaks under bursty load, partial outages, or data growth. The goal is to choose simple components first, then add coordination only where correctness or scale requires it.

Core idea / pattern

Start with clear workload boundaries, then map each boundary to compute, storage, and networking decisions. Use this page as the control plane and deep-dive into compute, storage, database internals, and consistency models.

Dimension Problem Pattern Trade-offs Failure modes
Request handling Unpredictable traffic and latency spikes Stateless services behind load balancers Simplicity vs cross-service chatter Retry storms, queue buildup, tail latency
Data ownership Shared mutable state across services Per-domain data ownership with explicit APIs Autonomy vs cross-domain joins Inconsistent reads, duplicate writes, drift
Reliability Node or zone failures Replication, quorum, health-based failover Higher durability vs write latency Split brain, stale replicas, quorum loss
Coordination Conflicting updates and leadership decisions Consensus-backed metadata and leases Stronger guarantees vs coordination overhead Leader flapping, lock leaks, stalled progress

Architecture diagram

flowchart LR
  Client[Client] --> Edge[Edge Gateway]
  Edge --> LB[Load Balancer]
  LB --> API1[Service A]
  LB --> API2[Service B]
  API1 --> Cache[(Cache)]
  API2 --> Cache
  API1 --> DB[(Primary Database)]
  API2 --> DB
  DB --> Repl[(Replicas)]
  API1 --> MQ[(Event Log / Queue)]
  API2 --> MQ
  MQ --> Worker[Async Workers]
  Worker --> DB
      

Step-by-step flow

  1. Define product SLOs: p95 latency, availability target, and correctness invariants.
  2. Model core entities and assign a clear owner for each write path.
  3. Choose compute topology: synchronous APIs for user paths, async workers for long tasks.
  4. Select storage shape based on query pattern and consistency requirements.
  5. Add replication and failover with explicit read and write consistency goals.
  6. Instrument latency, saturation, and error rates before scaling decisions.

Request lifecycle reference

sequenceDiagram
  participant U as User
  participant G as Gateway
  participant S as Service
  participant C as Cache
  participant D as Database
  participant Q as Queue

  U->>G: Request
  G->>S: Route with auth and limits
  S->>C: Read-through cache lookup
  C-->>S: Hit or miss
  S->>D: Query or write
  D-->>S: Result
  S->>Q: Emit async event
  S-->>U: Response
      

Failure modes

Trade-offs

Decision Benefit Cost
Monolith first Fast delivery and simpler ops Harder independent scaling later
Microservices early Clear domain boundaries Network and observability complexity
Strong consistency Simpler correctness reasoning Higher coordination latency
Eventual consistency High availability and throughput Conflict handling and stale reads

Real-world usage

System type Common architecture choices Primary concern
E-commerce Stateless APIs + relational core + async inventory workflows Correctness for orders and payments
Messaging Partitioned logs + fanout workers + presence cache Low latency and durability
Analytics platform Streaming ingest + columnar storage + batch compute Throughput and cost efficiency
AI retrieval service Metadata store + vector index + background embedding updates Recall, freshness, and tail latency