System Design: Networking Essentials

Problem framing

Networking is the connective tissue of every distributed system. Product features, reliability guarantees, and latency budgets all rise or fall on how well components communicate. The goal is not to memorize protocols, but to explain the intent, trade-offs, and failure modes behind the networking choices you make.

Tip: Tie every protocol, load balancer, or retry policy back to a requirement and a failure mode.

Problem Systems fail when communication breaks or becomes unpredictable.
Pattern Layered abstractions + clear request flow + defensive reliability primitives.
Trade-offs Latency vs. reliability, simplicity vs. control, speed vs. safety.
Failure modes Timeouts, retries, partitions, and cascading overload.

Core idea / pattern

Networking layers: conceptual map

Layering lets engineers reason at the right level: application contracts, transport guarantees, and network routing. When you explain this stack, you show you understand where problems originate.

flowchart TB
  App[Application Layer] --> Transport[Transport Layer]
  Transport --> Network[Network Layer]
  Network --> Physical[Physical / Data Link]
        
Layer Purpose Examples
Application Define request/response and streaming semantics HTTP, WebSockets, gRPC
Transport Reliability, ordering, congestion control TCP, UDP, QUIC
Network Routing packets between machines IP
Physical/Data Link Actual transmission over medium Ethernet, Wi-Fi
Problem Bugs hide when layers are mixed or misunderstood.
Pattern Identify the layer where the reliability or latency issue starts.
Trade-offs Abstraction ease vs. precise control of the wire.
Failure modes Misplaced fixes, such as tuning HTTP for a TCP bottleneck.

How a web request actually works

A single web request touches multiple systems: DNS, transport setup, application routing, and response streaming. Each hop adds latency and failure potential.

sequenceDiagram
  participant Client
  participant DNS
  participant Edge
  participant Service
  participant DB
  Client->>DNS: Resolve domain
  DNS-->>Client: IP address
  Client->>Edge: TCP/TLS handshake
  Client->>Edge: HTTP request
  Edge->>Service: Route request
  Service->>DB: Query data
  DB-->>Service: Result
  Service-->>Client: HTTP response
        
Problem Latency stacks up across every hop.
Pattern Minimize hops and reuse connections when possible.
Trade-offs Connection reuse vs. long-lived resource usage.
Failure modes DNS cache misses, handshake failures, or slow downstream calls.

Transport protocols

Transport protocols define reliability and speed. Pick the protocol that matches your latency and delivery guarantees, then validate with your SLA.

flowchart LR
  TCP[TCP: reliable] --> HTTP[HTTP/1.1, HTTP/2]
  UDP[UDP: fast] --> QUIC[QUIC: reliable over UDP]
  QUIC --> H3[HTTP/3]
        
Protocol Strengths Weaknesses Use cases
TCP Ordered, reliable delivery Handshake + congestion overhead Web traffic, databases
UDP Low latency, no connection setup No delivery guarantees Streaming, gaming, telemetry
QUIC Reliable + encrypted + fast handshakes Higher CPU cost, newer ecosystem HTTP/3, mobile clients
Problem Transport choices shape reliability and latency.
Pattern Match delivery guarantees to user expectations.
Trade-offs Reliability vs. speed and CPU overhead.
Failure modes Packet loss, congestion collapse, or reordering bugs.

Application layer protocols

Application protocols express how data flows: request/response, streaming, or bidirectional messaging. External APIs usually favor REST, while internal services favor gRPC. HTTP is stateless by default, and headers enable authentication, caching, and compression strategies. See SOLID design for API contract discipline.

flowchart LR
  Client --> REST[REST APIs]
  Client --> GraphQL[GraphQL]
  Client --> SSE[SSE Stream]
  Client <--> WS[WebSockets]
  ServiceA[Service A] <--> gRPC[gRPC]
        
Protocol Strength Trade-offs Best fit
REST Simple, cache-friendly Over-fetching, chatty calls Public APIs
GraphQL Client-controlled shape Complex execution, caching harder Flexible UI clients
gRPC Fast, typed, streaming Browser support limited Internal microservices
SSE Simple server push One-way only Notifications, feeds
WebSockets Full-duplex messaging Persistent connection cost Chat, collaboration
HTTP element Purpose Examples
Methods Describe intent GET, POST, PUT, PATCH, DELETE
Status codes Signal outcome 2xx success, 3xx redirect, 4xx client error, 5xx server error
Headers Carry metadata Auth, caching, compression, tracing
Problem Mismatched protocols create UX and scaling problems.
Pattern Choose based on interaction style and client ecosystem.
Trade-offs Flexibility vs. operational complexity.
Failure modes Overfetching, server overload, or dropped streams.

Load balancing fundamentals

Load balancing spreads traffic, protects availability, and supports horizontal scale. Client-side load balancing removes a hop but adds complexity. Server-side load balancing centralizes control.

flowchart LR
  ClientA[Client-side] --> S1[Service A]
  ClientA --> S2[Service B]
  ClientB[Client-side] --> S3[Service C]
  ClientB --> S4[Service D]
  User[User] --> LB[Server-side LB]
  LB --> S5[Service E]
  LB --> S6[Service F]
        
Approach Strengths Risks
Client-side Lower latency, no central bottleneck Client complexity and uneven upgrades
Server-side Centralized policy, easier operations Extra hop and critical dependency
Algorithm What it optimizes Risk
Round robin Even distribution Ignores instance health differences
Least connections Current load Does not predict spikes
Least latency Fastest response Overloads fastest instance
Hash-based Session affinity Hot keys and uneven load
Problem Uneven traffic and hot spots degrade availability.
Pattern Distribute traffic and protect the hottest path.
Trade-offs Control vs. added latency and dependency.
Failure modes Load balancer bottlenecks or sticky-session skew.

Layer 4 vs Layer 7 load balancers

Layer 4 balancers operate on TCP/UDP, while Layer 7 balancers inspect HTTP content. Use L4 for speed and L7 for routing intelligence.

flowchart LR
  Client --> L4[L4 LB]
  Client --> L7[L7 LB]
  L4 --> TCP[TCP/UDP streams]
  L7 --> HTTP[HTTP routing rules]
        
Feature L4 L7
Operates on TCP/UDP HTTP content
Performance Very high High, but slightly slower
Use cases Streaming, WebSockets API routing, canary releases
Problem Routing needs differ by protocol and payload.
Pattern Use L4 for speed, L7 for richer rules.
Trade-offs Speed vs. visibility into requests.
Failure modes Misrouting, TLS termination errors, or extra latency.

High availability and fault tolerance

Availability relies on fast health checks, automated failover, and safe retry policies. See compute patterns for scaling fundamentals.

flowchart LR
  Client --> LB[Load Balancer]
  LB --> S1[Healthy Service]
  LB -. Health Check .-> S2[Unhealthy Service]
  LB --> S3[Healthy Service]
  S1 --> Store[(Primary Store)]
  S3 --> Store
        
Technique Purpose Risk
Health checks Detect unhealthy nodes False positives if probes are shallow
Failover Route around failures Overloading remaining nodes
Retries + timeouts Handle transient errors Retry storms without backoff
Idempotency Safe retries for writes Complexity in data model
Problem Failures are inevitable in distributed systems.
Pattern Detect, isolate, and recover quickly.
Trade-offs Reliability vs. extra infrastructure and cost.
Failure modes Cascading failures, retries amplifying load.

Geographic scaling and latency

Physics sets the floor on latency. Global systems deploy closer to users and rely on CDNs for static content. See latency numbers to ground your estimates.

flowchart LR
  UserUS[User: US] --> EdgeUS[US Edge]
  UserEU[User: EU] --> EdgeEU[EU Edge]
  EdgeUS --> RegionUS[US Region]
  EdgeEU --> RegionEU[EU Region]
  RegionUS --> GlobalDB[(Global Data)]
  RegionEU --> GlobalDB
        
Approach Benefit Risk
Regional services Lower user latency Data consistency across regions
CDN caching Fast static content delivery Stale or inconsistent assets
Geo-partitioning Local data compliance Cross-region query complexity
Problem Global users feel latency and jitter quickly.
Pattern Bring compute and content closer to users.
Trade-offs Lower latency vs. consistency complexity.
Failure modes Regional outages or split-brain data.

Handling failures gracefully

Resilience patterns protect systems from cascading failures. Combine circuit breakers, retries with backoff, and bulkheads to contain overload.

flowchart LR
  Client --> Service[Service]
  Service --> Circuit[Circuit Breaker]
  Circuit --> Queue[(Queue)]
  Queue --> Worker[Worker Pool]
  Service --> Rate[Rate Limiter]
        
Pattern Strength Failure mode avoided
Circuit breaker Stops cascading retries Downstream overload
Bulkheads Isolates resource pools Noisy neighbor failures
Rate limiting Controls ingress load Traffic spikes
Backoff + jitter Spreads retries Retry storms
Problem Failure cascades can take down healthy systems.
Pattern Add guardrails to shed or slow traffic under stress.
Trade-offs Resilience vs. extra latency and complexity.
Failure modes Overly aggressive breakers blocking real traffic.

When to use: Any system with retries, shared dependencies, or bursty traffic.

Playground: retry storms and backpressure

Architecture diagram

This reference architecture shows how networking components fit together in a typical global system.

flowchart LR
  User[User] --> DNS[DNS Resolver]
  DNS --> Edge[Edge Gateway]
  Edge --> WAF[WAF + Rate Limits]
  WAF --> LB[Load Balancer]
  LB --> ServiceA[Service A]
  LB --> ServiceB[Service B]
  ServiceA --> Cache[(Cache)]
  ServiceB --> Cache
  ServiceA --> DB[(Primary DB)]
  ServiceB --> DB
  Edge --> CDN[CDN]
        

Step-by-step flow

  1. DNS resolves the domain to an IP address, often using caches and authoritative servers.
  2. The client establishes a TCP or QUIC connection (and TLS if needed).
  3. The HTTP request is sent with headers, cookies, and authentication tokens.
  4. Edge gateways route the request, apply rate limits, and forward to a backend service.
  5. The service processes data, calls downstream systems, and prepares the response.
  6. The response is returned and the connection is closed or reused with keep-alive.

Warning: If you skip DNS, connection setup, or retries, your latency estimates will be wrong.

Failure modes

Trade-offs

Trade-off Why it matters Typical choice
Latency vs. reliability Retries improve reliability but add latency. Retry with budget + backoff.
Centralized vs. distributed control Gateways simplify policy but add dependencies. Centralized edge, distributed services.
Consistency vs. availability Geo scale increases partition risk. See consistency models.
Speed vs. observability L7 inspection adds visibility but costs latency. Use L7 for routing, L4 for streams.

Real-world usage

When to use what (cheat sheet)

Scenario Recommended approach
Public APIs REST over HTTPS
Internal microservices gRPC + service mesh
Real-time chat WebSockets
Notifications Server-Sent Events
Large static content CDN + edge cache
Low-latency global apps Edge + regional compute

Key design takeaways

Summary

Strong networking intuition helps you design reliable, scalable systems. Anchor every choice in requirements, use clear diagrams, and be explicit about trade-offs.