Compute Patterns

Problem framing

Distributed systems must execute work under unpredictable traffic, tight latency budgets, and evolving data models. Compute paradigms describe how work is placed, scaled, and coordinated across machines. Strong designs choose a model based on product goals, failure modes, and trade-offs rather than naming technologies.

Problem Workload shape changes faster than infrastructure can be rebuilt.
Pattern Pick a compute model, then fit storage, networking, and scaling to it.
Trade-offs Latency vs. flexibility, simplicity vs. scale, speed vs. consistency.
Failure modes Bottlenecks, hot partitions, or state loss under failure.

Core idea / pattern

Modern systems blend client-server foundations with stateless services, stateful pipelines, and event-driven workflows. Each pattern below includes real-world usage, trade-offs, and failure modes to guide architecture decisions.

Client-server model

A centralized server handles computation while clients focus on presentation and input. This is the default model for web apps and SaaS dashboards, especially when requirements are clear and latency targets are modest.

flowchart LR
  Client --> LB[Load Balancer]
  LB --> App[Application Server]
  App --> DB[(Database)]
      
Strengths Weaknesses Real-world examples
Simple architecture, centralized control, easy security model. Server bottlenecks, single-region latency, scaling pressure. Admin dashboards, enterprise tools, monolithic web apps.
Problem Many clients need a shared source of truth and logic.
Pattern Route traffic to a central service tier with a consistent data store.
Trade-offs Simplicity vs. limited horizontal scale and higher latency at distance.
Failure modes Overloaded app servers or a single-region outage.

Stateless compute (cloud-native services)

Stateless services treat each request as independent. All shared state lives in external systems such as caches, queues, or databases. This is the core model for most cloud-native API stacks.

flowchart LR
  Client --> LB[Load Balancer]
  LB --> S1[Stateless Service]
  S1 --> Cache[(Cache)]
  S1 --> DB[(Database)]
      
Why it scales well Trade-offs Real systems
Instances scale horizontally and recover quickly. External state increases network I/O and cache complexity. Netflix APIs, Stripe payments, serverless backends.
Problem Spiky traffic needs fast scaling without session pinning.
Pattern Push state to Redis or databases and keep services disposable.
Trade-offs Elasticity vs. increased latency and cache consistency risk.
Failure modes Cache stampedes or state store saturation.

Stateful compute systems

Stateful compute co-locates data and processing to minimize hops. It is critical for databases, streaming engines, and low-latency systems where the data is hot and frequently accessed. See storage and data patterns for replication and sharding details.

flowchart LR
  Client --> Node[Stateful Node]
  Node --> Disk[(Local Data)]
      
Strengths Challenges Examples
Low-latency access, efficient for hot data. Complex failover, replication lag, careful sharding. Kafka brokers, Redis, stateful ML inference.
Problem Hot data needs fast access with minimal network hops.
Pattern Keep compute and state together, replicate for durability.
Trade-offs Speed vs. operational complexity and failover risk.
Failure modes Data loss, replication lag, or hot shard overload.

Scaling modes (horizontal vs vertical)

Horizontal scaling adds nodes; vertical scaling adds resources per node. Most modern systems scale horizontally because it improves resiliency and fits stateless patterns.

Mode Strengths Risks
Horizontal Fault tolerant, elastic growth. Coordination overhead, more moving parts.
Vertical Simpler ops, fewer dependencies. Hard limits, larger blast radius.
Problem Growth pushes single nodes beyond their limits.
Pattern Scale out first, then up for specialized bottlenecks.
Trade-offs Simplicity vs. resilience and operational flexibility.
Failure modes Oversized nodes or unstable coordination layers.

MapReduce and batch processing

Batch processing executes large jobs across clusters with fault-tolerant recomputation. MapReduce is the canonical model. See MapReduce (2004) for the original paper.

flowchart LR
  Input[Input Data] --> Map[Map Workers]
  Map --> Shuffle[Shuffle]
  Shuffle --> Reduce[Reduce Workers]
  Reduce --> Output[Output Data]
      
Strengths Weaknesses Real systems
Massive parallelism, fault recovery via retries. High latency, inefficient for low-latency needs. Hadoop, Spark, Google MapReduce.
Problem Large datasets need offline or periodic computation.
Pattern Split data, process in parallel, aggregate results.
Trade-offs Throughput vs. long end-to-end latency.
Failure modes Shuffle skew, slow reducers, or job retries cascading.

Stream processing (near real-time)

Stream processing runs continuous computation on unbounded data. It is the backbone for monitoring, fraud detection, and live analytics.

flowchart LR
  Producers --> Kafka[(Event Bus)]
  Kafka --> Stream[Stream Processor]
  Stream --> Sink[(Analytics or Store)]
      
Key features Use cases Risks
Windowing, stateful operators, exactly-once semantics. Fraud detection, metrics aggregation, monitoring. Lag buildup, out-of-order events.
Problem Insights must arrive seconds after events, not hours.
Pattern Process streams continuously with stateful operators.
Trade-offs Timeliness vs. operational complexity and state size.
Failure modes Backpressure, lag spikes, or inconsistent window state.

Data parallelism vs model parallelism

Distributed ML workloads either split the data across identical models (data parallelism) or split the model itself across nodes (model parallelism). This is common in training and large-scale inference.

flowchart LR
  subgraph DataParallel[Data Parallelism]
    Data[Data Split] --> ModelA[Model Replica A]
    Data --> ModelB[Model Replica B]
    ModelA --> Grad[Gradient Aggregation]
    ModelB --> Grad
  end
  subgraph ModelParallel[Model Parallelism]
    Input[Input] --> Layer1[Layer 1 on Node A]
    Layer1 --> Layer2[Layer 2 on Node B]
    Layer2 --> Output[Output]
  end
      
Approach Strengths Constraints
Data parallelism Scales throughput, simpler coordination. Needs fast gradient aggregation.
Model parallelism Supports very large models. Pipeline complexity and higher latency.
Problem ML workloads exceed single-node compute or memory.
Pattern Split data or model to distribute training and inference.
Trade-offs Throughput vs. synchronization overhead.
Failure modes Straggler nodes or gradient bottlenecks.

Event-driven and reactive systems

Event-driven systems react to events instead of polling. This supports loose coupling and scalable workflows. See networking patterns for event transport considerations.

flowchart LR
  User[User Action] --> Event[Event]
  Event --> Bus[(Event Bus)]
  Bus --> Fn[Trigger Function]
  Fn --> SideEffects[Side Effects]
      
Benefits Trade-offs Common tools
Loose coupling, async processing, scalable fan-out. Eventual consistency, observability complexity. Kafka, RabbitMQ, EventBridge.
Problem Systems need to respond to many asynchronous triggers.
Pattern Emit events and let independent consumers react.
Trade-offs Flexibility vs. tracing and ordering complexity.
Failure modes Lost events or duplicated processing.

Peer-to-peer compute

Peer-to-peer networks treat each node as both client and server. They remove central bottlenecks but introduce trust and consistency challenges.

flowchart LR
  PeerA[Peer A] <--> PeerB[Peer B]
  PeerB <--> PeerC[Peer C]
  PeerC <--> PeerA
      
Strengths Trade-offs Examples
No central bottleneck, resilient mesh. Complex consistency and trust management. BitTorrent, blockchain, WebRTC meshes.
Problem Need decentralized coordination with no central owner.
Pattern Nodes share workloads and data directly.
Trade-offs Resilience vs. governance and trust complexity.
Failure modes Sybil attacks, inconsistent state, or partitioned peers.

Hybrid and multi-tier architectures

Real systems blend patterns: stateless API gateways, stateful media relays, and event-driven analytics. The architecture below mirrors how modern SaaS systems combine compute tiers.

flowchart TB
  Client[Client] --> Gateway[API Gateway]
  Gateway --> Services[Stateless Services]
  Services --> Cache[(Cache)]
  Services --> DB[(Stateful Store)]
  Services --> Stream[Stream Processor]
  Stream --> Analytics[(Analytics Store)]
  Services --> Batch[Batch Jobs]
      
Tier Role Example
Stateless edge Auth, routing, request shaping. API gateway for Microsoft Teams.
Stateful core Low-latency data access. Media relays or session stores.
Async analytics Batch and streaming insights. Telemetry pipelines and dashboards.
Problem Single patterns rarely cover every workload shape.
Pattern Combine tiers to isolate latency-sensitive and async work.
Trade-offs Flexibility vs. integration complexity and cost.
Failure modes Coupled tiers causing cascading failures.

How to choose the right compute model

Requirement Best pattern
Low latency Stateful or edge compute
High scale Stateless + autoscaling
Fault tolerance Replication + event-driven pipelines
ML training Data or model parallelism
Real-time analytics Streaming
Simplicity Client-server

Mental model: Compute patterns are trade-offs between latency, consistency, scalability, and complexity. Optimize for context, not for maximal architecture.

Standard resources

Architecture diagram

flowchart LR
  Clients[Clients] --> Edge[Edge Gateway]
  Edge --> LB[Load Balancer]
  LB --> Stateless[Stateless Services]
  Stateless --> Cache[Distributed Cache]
  Stateless --> DB[(Stateful Store)]
  Stateless --> Stream[Stream Processor]
  Stream --> Analytics[(Analytics)]
  Stateless --> Batch[Batch Jobs]
      

Animated flow

Clients Load balancer Service A Service B Cache DB

Step-by-step flow

  1. A client resolves the service endpoint via DNS or anycast.
  2. The load balancer selects a healthy backend based on policy.
  3. The stateless service validates the request and fetches state from cache or storage.
  4. The service computes the response and writes state updates if needed.
  5. The response returns through the load balancer to the client.

Warning: If state is implicit in the service, horizontal scaling breaks under load.

Playground: Load balancing distribution

Failure modes

Trade-offs

Real-world usage