System Design: Networking Essentials
Problem framing
Networking is the connective tissue of every distributed system. Product features, reliability guarantees, and latency budgets all rise or fall on how well components communicate. The goal is not to memorize protocols, but to explain the intent, trade-offs, and failure modes behind the networking choices you make.
Tip: Tie every protocol, load balancer, or retry policy back to a requirement and a failure mode.
Core idea / pattern
Networking layers: conceptual map
Layering lets engineers reason at the right level: application contracts, transport guarantees, and network routing. When you explain this stack, you show you understand where problems originate.
flowchart TB
App[Application Layer] --> Transport[Transport Layer]
Transport --> Network[Network Layer]
Network --> Physical[Physical / Data Link]
| Layer | Purpose | Examples |
|---|---|---|
| Application | Define request/response and streaming semantics | HTTP, WebSockets, gRPC |
| Transport | Reliability, ordering, congestion control | TCP, UDP, QUIC |
| Network | Routing packets between machines | IP |
| Physical/Data Link | Actual transmission over medium | Ethernet, Wi-Fi |
How a web request actually works
A single web request touches multiple systems: DNS, transport setup, application routing, and response streaming. Each hop adds latency and failure potential.
sequenceDiagram
participant Client
participant DNS
participant Edge
participant Service
participant DB
Client->>DNS: Resolve domain
DNS-->>Client: IP address
Client->>Edge: TCP/TLS handshake
Client->>Edge: HTTP request
Edge->>Service: Route request
Service->>DB: Query data
DB-->>Service: Result
Service-->>Client: HTTP response
Transport protocols
Transport protocols define reliability and speed. Pick the protocol that matches your latency and delivery guarantees, then validate with your SLA.
flowchart LR
TCP[TCP: reliable] --> HTTP[HTTP/1.1, HTTP/2]
UDP[UDP: fast] --> QUIC[QUIC: reliable over UDP]
QUIC --> H3[HTTP/3]
| Protocol | Strengths | Weaknesses | Use cases |
|---|---|---|---|
| TCP | Ordered, reliable delivery | Handshake + congestion overhead | Web traffic, databases |
| UDP | Low latency, no connection setup | No delivery guarantees | Streaming, gaming, telemetry |
| QUIC | Reliable + encrypted + fast handshakes | Higher CPU cost, newer ecosystem | HTTP/3, mobile clients |
Application layer protocols
Application protocols express how data flows: request/response, streaming, or bidirectional messaging. External APIs usually favor REST, while internal services favor gRPC. HTTP is stateless by default, and headers enable authentication, caching, and compression strategies. See SOLID design for API contract discipline.
flowchart LR
Client --> REST[REST APIs]
Client --> GraphQL[GraphQL]
Client --> SSE[SSE Stream]
Client <--> WS[WebSockets]
ServiceA[Service A] <--> gRPC[gRPC]
| Protocol | Strength | Trade-offs | Best fit |
|---|---|---|---|
| REST | Simple, cache-friendly | Over-fetching, chatty calls | Public APIs |
| GraphQL | Client-controlled shape | Complex execution, caching harder | Flexible UI clients |
| gRPC | Fast, typed, streaming | Browser support limited | Internal microservices |
| SSE | Simple server push | One-way only | Notifications, feeds |
| WebSockets | Full-duplex messaging | Persistent connection cost | Chat, collaboration |
| HTTP element | Purpose | Examples |
|---|---|---|
| Methods | Describe intent | GET, POST, PUT, PATCH, DELETE |
| Status codes | Signal outcome | 2xx success, 3xx redirect, 4xx client error, 5xx server error |
| Headers | Carry metadata | Auth, caching, compression, tracing |
Load balancing fundamentals
Load balancing spreads traffic, protects availability, and supports horizontal scale. Client-side load balancing removes a hop but adds complexity. Server-side load balancing centralizes control.
flowchart LR
ClientA[Client-side] --> S1[Service A]
ClientA --> S2[Service B]
ClientB[Client-side] --> S3[Service C]
ClientB --> S4[Service D]
User[User] --> LB[Server-side LB]
LB --> S5[Service E]
LB --> S6[Service F]
| Approach | Strengths | Risks |
|---|---|---|
| Client-side | Lower latency, no central bottleneck | Client complexity and uneven upgrades |
| Server-side | Centralized policy, easier operations | Extra hop and critical dependency |
| Algorithm | What it optimizes | Risk |
|---|---|---|
| Round robin | Even distribution | Ignores instance health differences |
| Least connections | Current load | Does not predict spikes |
| Least latency | Fastest response | Overloads fastest instance |
| Hash-based | Session affinity | Hot keys and uneven load |
Layer 4 vs Layer 7 load balancers
Layer 4 balancers operate on TCP/UDP, while Layer 7 balancers inspect HTTP content. Use L4 for speed and L7 for routing intelligence.
flowchart LR
Client --> L4[L4 LB]
Client --> L7[L7 LB]
L4 --> TCP[TCP/UDP streams]
L7 --> HTTP[HTTP routing rules]
| Feature | L4 | L7 |
|---|---|---|
| Operates on | TCP/UDP | HTTP content |
| Performance | Very high | High, but slightly slower |
| Use cases | Streaming, WebSockets | API routing, canary releases |
High availability and fault tolerance
Availability relies on fast health checks, automated failover, and safe retry policies. See compute patterns for scaling fundamentals.
flowchart LR
Client --> LB[Load Balancer]
LB --> S1[Healthy Service]
LB -. Health Check .-> S2[Unhealthy Service]
LB --> S3[Healthy Service]
S1 --> Store[(Primary Store)]
S3 --> Store
| Technique | Purpose | Risk |
|---|---|---|
| Health checks | Detect unhealthy nodes | False positives if probes are shallow |
| Failover | Route around failures | Overloading remaining nodes |
| Retries + timeouts | Handle transient errors | Retry storms without backoff |
| Idempotency | Safe retries for writes | Complexity in data model |
Geographic scaling and latency
Physics sets the floor on latency. Global systems deploy closer to users and rely on CDNs for static content. See latency numbers to ground your estimates.
flowchart LR
UserUS[User: US] --> EdgeUS[US Edge]
UserEU[User: EU] --> EdgeEU[EU Edge]
EdgeUS --> RegionUS[US Region]
EdgeEU --> RegionEU[EU Region]
RegionUS --> GlobalDB[(Global Data)]
RegionEU --> GlobalDB
| Approach | Benefit | Risk |
|---|---|---|
| Regional services | Lower user latency | Data consistency across regions |
| CDN caching | Fast static content delivery | Stale or inconsistent assets |
| Geo-partitioning | Local data compliance | Cross-region query complexity |
Handling failures gracefully
Resilience patterns protect systems from cascading failures. Combine circuit breakers, retries with backoff, and bulkheads to contain overload.
flowchart LR
Client --> Service[Service]
Service --> Circuit[Circuit Breaker]
Circuit --> Queue[(Queue)]
Queue --> Worker[Worker Pool]
Service --> Rate[Rate Limiter]
| Pattern | Strength | Failure mode avoided |
|---|---|---|
| Circuit breaker | Stops cascading retries | Downstream overload |
| Bulkheads | Isolates resource pools | Noisy neighbor failures |
| Rate limiting | Controls ingress load | Traffic spikes |
| Backoff + jitter | Spreads retries | Retry storms |
When to use: Any system with retries, shared dependencies, or bursty traffic.
Playground: retry storms and backpressure
Architecture diagram
This reference architecture shows how networking components fit together in a typical global system.
flowchart LR
User[User] --> DNS[DNS Resolver]
DNS --> Edge[Edge Gateway]
Edge --> WAF[WAF + Rate Limits]
WAF --> LB[Load Balancer]
LB --> ServiceA[Service A]
LB --> ServiceB[Service B]
ServiceA --> Cache[(Cache)]
ServiceB --> Cache
ServiceA --> DB[(Primary DB)]
ServiceB --> DB
Edge --> CDN[CDN]
Step-by-step flow
- DNS resolves the domain to an IP address, often using caches and authoritative servers.
- The client establishes a TCP or QUIC connection (and TLS if needed).
- The HTTP request is sent with headers, cookies, and authentication tokens.
- Edge gateways route the request, apply rate limits, and forward to a backend service.
- The service processes data, calls downstream systems, and prepares the response.
- The response is returned and the connection is closed or reused with keep-alive.
Warning: If you skip DNS, connection setup, or retries, your latency estimates will be wrong.
Failure modes
- Network partitions split regions and cause inconsistent reads or writes.
- Server crashes or overload lead to timeouts and retry amplification.
- Cascading failures overwhelm healthy dependencies.
- DNS propagation delays send traffic to retired instances.
- Misconfigured load balancers create hot spots or failed health checks.
Trade-offs
| Trade-off | Why it matters | Typical choice |
|---|---|---|
| Latency vs. reliability | Retries improve reliability but add latency. | Retry with budget + backoff. |
| Centralized vs. distributed control | Gateways simplify policy but add dependencies. | Centralized edge, distributed services. |
| Consistency vs. availability | Geo scale increases partition risk. | See consistency models. |
| Speed vs. observability | L7 inspection adds visibility but costs latency. | Use L7 for routing, L4 for streams. |
Real-world usage
When to use what (cheat sheet)
| Scenario | Recommended approach |
|---|---|
| Public APIs | REST over HTTPS |
| Internal microservices | gRPC + service mesh |
| Real-time chat | WebSockets |
| Notifications | Server-Sent Events |
| Large static content | CDN + edge cache |
| Low-latency global apps | Edge + regional compute |
Key design takeaways
- Clarify requirements before you pick protocols or load balancers.
- Explain trade-offs instead of listing components.
- Keep the design simple until scale or reliability demands complexity.
- Always call out failure modes and how you detect them.
- Justify latency, retries, and connection reuse with numbers.
Summary
Strong networking intuition helps you design reliable, scalable systems. Anchor every choice in requirements, use clear diagrams, and be explicit about trade-offs.