System Design: Networking Essentials

Problem framing

Networking is the connective tissue of every distributed system. Product features, reliability guarantees, and latency budgets all rise or fall on how well components communicate. The goal is not to memorize protocols, but to explain the intent, trade-offs, and failure modes behind the networking choices you make.

Tip: Tie every protocol, load balancer, or retry policy back to a requirement and a failure mode.

Problem Systems fail when communication breaks or becomes unpredictable.

Pattern Layered abstractions + clear request flow + defensive reliability primitives.

Trade-offs Latency vs. reliability, simplicity vs. control, speed vs. safety.

Failure modes Timeouts, retries, partitions, and cascading overload.

Core idea / pattern

Networking layers: conceptual map

Layering lets engineers reason at the right level: application contracts, transport guarantees, and network routing. When you explain this stack, you show you understand where problems originate.

flowchart TB
  App[Application Layer] --> Transport[Transport Layer]
  Transport --> Network[Network Layer]
  Network --> Physical[Physical / Data Link]

Layer	Purpose	Examples
Application	Define request/response and streaming semantics	HTTP, WebSockets, gRPC
Transport	Reliability, ordering, congestion control	TCP, UDP, QUIC
Network	Routing packets between machines	IP
Physical/Data Link	Actual transmission over medium	Ethernet, Wi-Fi

Problem Bugs hide when layers are mixed or misunderstood.

Pattern Identify the layer where the reliability or latency issue starts.

Trade-offs Abstraction ease vs. precise control of the wire.

Failure modes Misplaced fixes, such as tuning HTTP for a TCP bottleneck.

How a web request actually works

A single web request touches multiple systems: DNS, transport setup, application routing, and response streaming. Each hop adds latency and failure potential.

sequenceDiagram
  participant Client
  participant DNS
  participant Edge
  participant Service
  participant DB
  Client->>DNS: Resolve domain
  DNS-->>Client: IP address
  Client->>Edge: TCP/TLS handshake
  Client->>Edge: HTTP request
  Edge->>Service: Route request
  Service->>DB: Query data
  DB-->>Service: Result
  Service-->>Client: HTTP response

Problem Latency stacks up across every hop.

Pattern Minimize hops and reuse connections when possible.

Trade-offs Connection reuse vs. long-lived resource usage.

Failure modes DNS cache misses, handshake failures, or slow downstream calls.

Transport protocols

Transport protocols define reliability and speed. Pick the protocol that matches your latency and delivery guarantees, then validate with your SLA.

flowchart LR
  TCP[TCP: reliable] --> HTTP[HTTP/1.1, HTTP/2]
  UDP[UDP: fast] --> QUIC[QUIC: reliable over UDP]
  QUIC --> H3[HTTP/3]

Protocol	Strengths	Weaknesses	Use cases
TCP	Ordered, reliable delivery	Handshake + congestion overhead	Web traffic, databases
UDP	Low latency, no connection setup	No delivery guarantees	Streaming, gaming, telemetry
QUIC	Reliable + encrypted + fast handshakes	Higher CPU cost, newer ecosystem	HTTP/3, mobile clients

Problem Transport choices shape reliability and latency.

Pattern Match delivery guarantees to user expectations.

Trade-offs Reliability vs. speed and CPU overhead.

Failure modes Packet loss, congestion collapse, or reordering bugs.

Application layer protocols

Application protocols express how data flows: request/response, streaming, or bidirectional messaging. External APIs usually favor REST, while internal services favor gRPC. HTTP is stateless by default, and headers enable authentication, caching, and compression strategies. See SOLID design for API contract discipline.

flowchart LR
  Client --> REST[REST APIs]
  Client --> GraphQL[GraphQL]
  Client --> SSE[SSE Stream]
  Client <--> WS[WebSockets]
  ServiceA[Service A] <--> gRPC[gRPC]

Protocol	Strength	Trade-offs	Best fit
REST	Simple, cache-friendly	Over-fetching, chatty calls	Public APIs
GraphQL	Client-controlled shape	Complex execution, caching harder	Flexible UI clients
gRPC	Fast, typed, streaming	Browser support limited	Internal microservices
SSE	Simple server push	One-way only	Notifications, feeds
WebSockets	Full-duplex messaging	Persistent connection cost	Chat, collaboration

HTTP element	Purpose	Examples
Methods	Describe intent	GET, POST, PUT, PATCH, DELETE
Status codes	Signal outcome	2xx success, 3xx redirect, 4xx client error, 5xx server error
Headers	Carry metadata	Auth, caching, compression, tracing

Problem Mismatched protocols create UX and scaling problems.

Pattern Choose based on interaction style and client ecosystem.

Trade-offs Flexibility vs. operational complexity.

Failure modes Overfetching, server overload, or dropped streams.

Load balancing fundamentals

Load balancing spreads traffic, protects availability, and supports horizontal scale. Client-side load balancing removes a hop but adds complexity. Server-side load balancing centralizes control.

flowchart LR
  ClientA[Client-side] --> S1[Service A]
  ClientA --> S2[Service B]
  ClientB[Client-side] --> S3[Service C]
  ClientB --> S4[Service D]
  User[User] --> LB[Server-side LB]
  LB --> S5[Service E]
  LB --> S6[Service F]

Approach	Strengths	Risks
Client-side	Lower latency, no central bottleneck	Client complexity and uneven upgrades
Server-side	Centralized policy, easier operations	Extra hop and critical dependency

Algorithm	What it optimizes	Risk
Round robin	Even distribution	Ignores instance health differences
Least connections	Current load	Does not predict spikes
Least latency	Fastest response	Overloads fastest instance
Hash-based	Session affinity	Hot keys and uneven load

Problem Uneven traffic and hot spots degrade availability.

Pattern Distribute traffic and protect the hottest path.

Trade-offs Control vs. added latency and dependency.

Failure modes Load balancer bottlenecks or sticky-session skew.

Layer 4 vs Layer 7 load balancers

Layer 4 balancers operate on TCP/UDP, while Layer 7 balancers inspect HTTP content. Use L4 for speed and L7 for routing intelligence.

flowchart LR
  Client --> L4[L4 LB]
  Client --> L7[L7 LB]
  L4 --> TCP[TCP/UDP streams]
  L7 --> HTTP[HTTP routing rules]

Feature	L4	L7
Operates on	TCP/UDP	HTTP content
Performance	Very high	High, but slightly slower
Use cases	Streaming, WebSockets	API routing, canary releases

Problem Routing needs differ by protocol and payload.

Pattern Use L4 for speed, L7 for richer rules.

Trade-offs Speed vs. visibility into requests.

Failure modes Misrouting, TLS termination errors, or extra latency.

High availability and fault tolerance

Availability relies on fast health checks, automated failover, and safe retry policies. See compute patterns for scaling fundamentals.

flowchart LR
  Client --> LB[Load Balancer]
  LB --> S1[Healthy Service]
  LB -. Health Check .-> S2[Unhealthy Service]
  LB --> S3[Healthy Service]
  S1 --> Store[(Primary Store)]
  S3 --> Store

Technique	Purpose	Risk
Health checks	Detect unhealthy nodes	False positives if probes are shallow
Failover	Route around failures	Overloading remaining nodes
Retries + timeouts	Handle transient errors	Retry storms without backoff
Idempotency	Safe retries for writes	Complexity in data model

Problem Failures are inevitable in distributed systems.

Pattern Detect, isolate, and recover quickly.

Trade-offs Reliability vs. extra infrastructure and cost.

Failure modes Cascading failures, retries amplifying load.

Geographic scaling and latency

Physics sets the floor on latency. Global systems deploy closer to users and rely on CDNs for static content. See latency numbers to ground your estimates.

flowchart LR
  UserUS[User: US] --> EdgeUS[US Edge]
  UserEU[User: EU] --> EdgeEU[EU Edge]
  EdgeUS --> RegionUS[US Region]
  EdgeEU --> RegionEU[EU Region]
  RegionUS --> GlobalDB[(Global Data)]
  RegionEU --> GlobalDB

Approach	Benefit	Risk
Regional services	Lower user latency	Data consistency across regions
CDN caching	Fast static content delivery	Stale or inconsistent assets
Geo-partitioning	Local data compliance	Cross-region query complexity

Problem Global users feel latency and jitter quickly.

Pattern Bring compute and content closer to users.

Trade-offs Lower latency vs. consistency complexity.

Failure modes Regional outages or split-brain data.

Handling failures gracefully

Resilience patterns protect systems from cascading failures. Combine circuit breakers, retries with backoff, and bulkheads to contain overload.

flowchart LR
  Client --> Service[Service]
  Service --> Circuit[Circuit Breaker]
  Circuit --> Queue[(Queue)]
  Queue --> Worker[Worker Pool]
  Service --> Rate[Rate Limiter]

Pattern	Strength	Failure mode avoided
Circuit breaker	Stops cascading retries	Downstream overload
Bulkheads	Isolates resource pools	Noisy neighbor failures
Rate limiting	Controls ingress load	Traffic spikes
Backoff + jitter	Spreads retries	Retry storms

Problem Failure cascades can take down healthy systems.

Pattern Add guardrails to shed or slow traffic under stress.

Trade-offs Resilience vs. extra latency and complexity.

Failure modes Overly aggressive breakers blocking real traffic.

When to use: Any system with retries, shared dependencies, or bursty traffic.

Playground: retry storms and backpressure

Architecture diagram

This reference architecture shows how networking components fit together in a typical global system.

flowchart LR
  User[User] --> DNS[DNS Resolver]
  DNS --> Edge[Edge Gateway]
  Edge --> WAF[WAF + Rate Limits]
  WAF --> LB[Load Balancer]
  LB --> ServiceA[Service A]
  LB --> ServiceB[Service B]
  ServiceA --> Cache[(Cache)]
  ServiceB --> Cache
  ServiceA --> DB[(Primary DB)]
  ServiceB --> DB
  Edge --> CDN[CDN]

Step-by-step flow

DNS resolves the domain to an IP address, often using caches and authoritative servers.
The client establishes a TCP or QUIC connection (and TLS if needed).
The HTTP request is sent with headers, cookies, and authentication tokens.
Edge gateways route the request, apply rate limits, and forward to a backend service.
The service processes data, calls downstream systems, and prepares the response.
The response is returned and the connection is closed or reused with keep-alive.

Warning: If you skip DNS, connection setup, or retries, your latency estimates will be wrong.

Failure modes

Network partitions split regions and cause inconsistent reads or writes.
Server crashes or overload lead to timeouts and retry amplification.
Cascading failures overwhelm healthy dependencies.
DNS propagation delays send traffic to retired instances.
Misconfigured load balancers create hot spots or failed health checks.

Trade-offs

Trade-off	Why it matters	Typical choice
Latency vs. reliability	Retries improve reliability but add latency.	Retry with budget + backoff.
Centralized vs. distributed control	Gateways simplify policy but add dependencies.	Centralized edge, distributed services.
Consistency vs. availability	Geo scale increases partition risk.	See consistency models.
Speed vs. observability	L7 inspection adds visibility but costs latency.	Use L7 for routing, L4 for streams.

Real-world usage

When to use what (cheat sheet)

Scenario	Recommended approach
Public APIs	REST over HTTPS
Internal microservices	gRPC + service mesh
Real-time chat	WebSockets
Notifications	Server-Sent Events
Large static content	CDN + edge cache
Low-latency global apps	Edge + regional compute

Key design takeaways

Clarify requirements before you pick protocols or load balancers.
Explain trade-offs instead of listing components.
Keep the design simple until scale or reliability demands complexity.
Always call out failure modes and how you detect them.
Justify latency, retries, and connection reuse with numbers.

Summary

Strong networking intuition helps you design reliable, scalable systems. Anchor every choice in requirements, use clear diagrams, and be explicit about trade-offs.