Compute Patterns
Problem framing
Distributed systems must execute work under unpredictable traffic, tight latency budgets, and evolving data models. Compute paradigms describe how work is placed, scaled, and coordinated across machines. Strong designs choose a model based on product goals, failure modes, and trade-offs rather than naming technologies.
Core idea / pattern
Modern systems blend client-server foundations with stateless services, stateful pipelines, and event-driven workflows. Each pattern below includes real-world usage, trade-offs, and failure modes to guide architecture decisions.
Client-server model
A centralized server handles computation while clients focus on presentation and input. This is the default model for web apps and SaaS dashboards, especially when requirements are clear and latency targets are modest.
flowchart LR
Client --> LB[Load Balancer]
LB --> App[Application Server]
App --> DB[(Database)]
| Strengths | Weaknesses | Real-world examples |
|---|---|---|
| Simple architecture, centralized control, easy security model. | Server bottlenecks, single-region latency, scaling pressure. | Admin dashboards, enterprise tools, monolithic web apps. |
Stateless compute (cloud-native services)
Stateless services treat each request as independent. All shared state lives in external systems such as caches, queues, or databases. This is the core model for most cloud-native API stacks.
flowchart LR
Client --> LB[Load Balancer]
LB --> S1[Stateless Service]
S1 --> Cache[(Cache)]
S1 --> DB[(Database)]
| Why it scales well | Trade-offs | Real systems |
|---|---|---|
| Instances scale horizontally and recover quickly. | External state increases network I/O and cache complexity. | Netflix APIs, Stripe payments, serverless backends. |
Stateful compute systems
Stateful compute co-locates data and processing to minimize hops. It is critical for databases, streaming engines, and low-latency systems where the data is hot and frequently accessed. See storage and data patterns for replication and sharding details.
flowchart LR
Client --> Node[Stateful Node]
Node --> Disk[(Local Data)]
| Strengths | Challenges | Examples |
|---|---|---|
| Low-latency access, efficient for hot data. | Complex failover, replication lag, careful sharding. | Kafka brokers, Redis, stateful ML inference. |
Scaling modes (horizontal vs vertical)
Horizontal scaling adds nodes; vertical scaling adds resources per node. Most modern systems scale horizontally because it improves resiliency and fits stateless patterns.
| Mode | Strengths | Risks |
|---|---|---|
| Horizontal | Fault tolerant, elastic growth. | Coordination overhead, more moving parts. |
| Vertical | Simpler ops, fewer dependencies. | Hard limits, larger blast radius. |
MapReduce and batch processing
Batch processing executes large jobs across clusters with fault-tolerant recomputation. MapReduce is the canonical model. See MapReduce (2004) for the original paper.
flowchart LR
Input[Input Data] --> Map[Map Workers]
Map --> Shuffle[Shuffle]
Shuffle --> Reduce[Reduce Workers]
Reduce --> Output[Output Data]
| Strengths | Weaknesses | Real systems |
|---|---|---|
| Massive parallelism, fault recovery via retries. | High latency, inefficient for low-latency needs. | Hadoop, Spark, Google MapReduce. |
Stream processing (near real-time)
Stream processing runs continuous computation on unbounded data. It is the backbone for monitoring, fraud detection, and live analytics.
flowchart LR
Producers --> Kafka[(Event Bus)]
Kafka --> Stream[Stream Processor]
Stream --> Sink[(Analytics or Store)]
| Key features | Use cases | Risks |
|---|---|---|
| Windowing, stateful operators, exactly-once semantics. | Fraud detection, metrics aggregation, monitoring. | Lag buildup, out-of-order events. |
Data parallelism vs model parallelism
Distributed ML workloads either split the data across identical models (data parallelism) or split the model itself across nodes (model parallelism). This is common in training and large-scale inference.
flowchart LR
subgraph DataParallel[Data Parallelism]
Data[Data Split] --> ModelA[Model Replica A]
Data --> ModelB[Model Replica B]
ModelA --> Grad[Gradient Aggregation]
ModelB --> Grad
end
subgraph ModelParallel[Model Parallelism]
Input[Input] --> Layer1[Layer 1 on Node A]
Layer1 --> Layer2[Layer 2 on Node B]
Layer2 --> Output[Output]
end
| Approach | Strengths | Constraints |
|---|---|---|
| Data parallelism | Scales throughput, simpler coordination. | Needs fast gradient aggregation. |
| Model parallelism | Supports very large models. | Pipeline complexity and higher latency. |
Event-driven and reactive systems
Event-driven systems react to events instead of polling. This supports loose coupling and scalable workflows. See networking patterns for event transport considerations.
flowchart LR
User[User Action] --> Event[Event]
Event --> Bus[(Event Bus)]
Bus --> Fn[Trigger Function]
Fn --> SideEffects[Side Effects]
| Benefits | Trade-offs | Common tools |
|---|---|---|
| Loose coupling, async processing, scalable fan-out. | Eventual consistency, observability complexity. | Kafka, RabbitMQ, EventBridge. |
Peer-to-peer compute
Peer-to-peer networks treat each node as both client and server. They remove central bottlenecks but introduce trust and consistency challenges.
flowchart LR
PeerA[Peer A] <--> PeerB[Peer B]
PeerB <--> PeerC[Peer C]
PeerC <--> PeerA
| Strengths | Trade-offs | Examples |
|---|---|---|
| No central bottleneck, resilient mesh. | Complex consistency and trust management. | BitTorrent, blockchain, WebRTC meshes. |
Hybrid and multi-tier architectures
Real systems blend patterns: stateless API gateways, stateful media relays, and event-driven analytics. The architecture below mirrors how modern SaaS systems combine compute tiers.
flowchart TB
Client[Client] --> Gateway[API Gateway]
Gateway --> Services[Stateless Services]
Services --> Cache[(Cache)]
Services --> DB[(Stateful Store)]
Services --> Stream[Stream Processor]
Stream --> Analytics[(Analytics Store)]
Services --> Batch[Batch Jobs]
| Tier | Role | Example |
|---|---|---|
| Stateless edge | Auth, routing, request shaping. | API gateway for Microsoft Teams. |
| Stateful core | Low-latency data access. | Media relays or session stores. |
| Async analytics | Batch and streaming insights. | Telemetry pipelines and dashboards. |
How to choose the right compute model
| Requirement | Best pattern |
|---|---|
| Low latency | Stateful or edge compute |
| High scale | Stateless + autoscaling |
| Fault tolerance | Replication + event-driven pipelines |
| ML training | Data or model parallelism |
| Real-time analytics | Streaming |
| Simplicity | Client-server |
Mental model: Compute patterns are trade-offs between latency, consistency, scalability, and complexity. Optimize for context, not for maximal architecture.
Standard resources
- MapReduce: Simplified Data Processing on Large Clusters - Dean & Ghemawat (2004)
- Dynamo: Amazon's Highly Available Key-value Store - DeCandia et al. (2007)
Architecture diagram
flowchart LR
Clients[Clients] --> Edge[Edge Gateway]
Edge --> LB[Load Balancer]
LB --> Stateless[Stateless Services]
Stateless --> Cache[Distributed Cache]
Stateless --> DB[(Stateful Store)]
Stateless --> Stream[Stream Processor]
Stream --> Analytics[(Analytics)]
Stateless --> Batch[Batch Jobs]
Animated flow
Step-by-step flow
- A client resolves the service endpoint via DNS or anycast.
- The load balancer selects a healthy backend based on policy.
- The stateless service validates the request and fetches state from cache or storage.
- The service computes the response and writes state updates if needed.
- The response returns through the load balancer to the client.
Warning: If state is implicit in the service, horizontal scaling breaks under load.
Playground: Load balancing distribution
Failure modes
- Stateless services overwhelm shared caches or databases during traffic spikes.
- Stateful nodes fail before replication, causing session loss or data gaps.
- Batch pipelines stall on shuffle skew or slow reducers.
- Stream processors fall behind when backpressure is ignored.
- Event-driven systems duplicate work due to missing idempotency.
- Peer-to-peer systems split during partitions or trust failures.
Trade-offs
- Statelessness favors elasticity but increases dependency on storage and cache tiers.
- Stateful compute reduces latency but complicates failover and scaling.
- Batch processing maximizes throughput but sacrifices real-time responsiveness.
- Streaming delivers low latency insights but requires complex state handling.
- Event-driven workflows improve decoupling but need strong observability.
Real-world usage
- Stateless APIs sit behind L7 gateways such as Envoy or NGINX.
- Stateful services include databases, Kafka, and ML inference nodes.
- Batch pipelines feed analytics warehouses and offline reporting.
- Stream processing powers fraud detection and monitoring pipelines.
- Event-driven systems drive notifications, automation, and telemetry.
- Autoscaling patterns are often implemented with Kubernetes HPA in Kubernetes patterns.