RouteHardenHire us
Back to Networking Fundamentals
Networking Fundamentals · Part 8 of 12·Network Hardening··20 min read·intermediate

TCP congestion control

Why congestion control exists, how slow start and AIMD actually behave, what CUBIC and BBR change, and how bufferbloat ruins everything if you let it.

The internet is a shared resource that nobody owns and everybody uses, and the only thing keeping it from grinding to a halt during peak hours is that every TCP sender voluntarily slows down when the network is busy. There is no central traffic cop. The whole stability of TCP-based traffic on the public internet rests on a social contract implemented inside billions of operating-system kernels: probe for capacity, back off when you see signs of congestion, share what's left fairly with other senders. This module is about that contract. By the end of it you should understand why one packet getting dropped can make a global TCP flow back off by half, what changed between Reno and CUBIC and BBR, and why bufferbloat — the most common modern variant of "the network is slow" — is actually a misconfiguration, not a capacity problem.

Prerequisites

  • Module 1.7 — TCP at the wire level. You need fluency with sequence numbers, ACK semantics, retransmission timers, and the distinction between rwnd and cwnd.

Learning objectives

By the end of this module you should be able to:

  1. Explain why congestion control exists as a network-stability mechanism, not a per-flow performance feature.
  2. Trace slow start, congestion avoidance, fast retransmit, and multiplicative decrease on a cwnd timeline.
  3. Compare Reno/NewReno, CUBIC, and BBR at the design-principle level, and predict which is more appropriate for a given path.
  4. Diagnose bufferbloat, unfairness between flows, and application-limited flow behavior in practical terms.
  5. Inspect the congestion-control algorithm and per-socket telemetry on a live Linux box.

Why one dropped packet can be a global problem

In October 1986 the internet — all of it, which at the time meant a few hundred sites — collapsed. Throughput between Lawrence Berkeley Laboratory and UC Berkeley, networks separated by 400 yards, dropped from 32 kbps to 40 bps. Three orders of magnitude. Nothing was broken physically. The cause was a positive feedback loop: when packets were dropped due to congestion, hosts retransmitted aggressively, generating more traffic, increasing congestion, generating more drops, generating more retransmissions. The network was choking itself.

Van Jacobson and Mike Karels diagnosed it and wrote the paper that introduced TCP congestion control. The fix was conceptually simple: each sender should treat packet loss as a signal that the network is congested, and slow down. If everyone slows down a little when they see loss, total traffic stays roughly at the network's capacity rather than spiraling above it.

This is what makes TCP congestion control different from flow control or reliability:

  • Flow control (the rwnd field) is "don't send faster than I can receive." It's a contract between two endpoints.
  • Reliability (retransmissions, ACKs) is "make sure every byte arrives." Also between two endpoints.
  • Congestion control is "don't send faster than the network can carry." It's a contract between every TCP sender and every other TCP sender on the same path.

The third one only works if everyone implements it. It's a coordination problem with no central enforcer. Every TCP sender backs off voluntarily, on the same algorithm, because the alternative is congestion collapse for everyone — including themselves.

The cost of being a bad citizen is paid by the commons. The cost of being a good citizen is paid by you. That tension is what every congestion-control algorithm is trying to balance.

The congestion window and the ACK clock

The mechanism is a single sender-side variable, cwnd (the congestion window), measured in bytes. The sender will not have more than cwnd bytes of unacknowledged data in flight at any moment.

Combined with the receiver's window (rwnd), the actual cap is min(rwnd, cwnd). On a healthy modern path with TCP window scaling enabled (Module 1.7), cwnd is usually the binding constraint, not rwnd.

The pacing of new data is driven by the ACK clock: every time an acknowledgment comes back, it implicitly says "the network just delivered some bytes; you can send some new ones." The sender doesn't blast at line rate; it spaces transmissions in proportion to the rate at which ACKs return. On a stable path this self-clocks at the actual bottleneck rate, because that's the rate at which ACKs come back.

When cwnd grows, the sender allows more bytes in flight. When cwnd shrinks, the sender allows fewer. The whole game is the algorithm that controls cwnd's growth and shrinkage, and the signals it uses to decide when to do which.

Slow start is not slow

When a connection begins, the sender has no information about the path's capacity. The classical algorithm starts conservatively — cwnd = 10 * MSS (RFC 6928 raised this from the original cwnd = 1) — and grows aggressively until it sees a sign of congestion.

Specifically: cwnd doubles every RTT.

This is slow start, named in 1988 in contrast to algorithms that started at full speed. By modern standards, exponential per-RTT growth is the opposite of slow. After 10 RTTs you're at 1024 * 10 * MSS ≈ 1.5 MB. After 20 RTTs you're at 1.5 GB worth of cwnd, which is more than any rwnd can ever offer. On a 100 ms RTT path, slow start can saturate a 10 Gbps link in around two seconds — fast enough that for short flows like HTTP requests, slow start is the entire transfer.

Slow start ends in one of two ways:

  1. Loss is detected. A packet is missing; the sender treats this as evidence of congestion. cwnd is halved (or set to a slow-start threshold, ssthresh, plus a few segments) and the algorithm transitions to congestion avoidance.
  2. cwnd reaches ssthresh. A previously-cached estimate of the safe window. The algorithm transitions to congestion avoidance without seeing loss.

The transition to congestion avoidance is where slow start's exponential growth gives way to the linear-growth phase that follows.

Additive increase, multiplicative decrease

Once in congestion avoidance, the algorithm probes for slightly more capacity carefully. Per RFC 5681:

  • Additive increase. Every RTT in which no loss is observed, cwnd grows by approximately one MSS. Linear, not exponential.
  • Multiplicative decrease. When loss is detected, cwnd is cut in half (cwnd /= 2). Geometric.

This pair is AIMD: additive increase, multiplicative decrease. The reason it's the choice — and not, say, "additive increase, additive decrease," which sounds more symmetric — is fairness.

Two TCP flows sharing a bottleneck link will, under AIMD, converge toward equal shares of the bottleneck capacity. The intuition: when both flows are sharing fairly, both grow linearly. When the link is full and both back off multiplicatively, the one with more cwnd loses more in absolute terms. Over many rounds, the relative gap between the two flows shrinks. AIMD is provably fair in the long run for flows on the same path with the same RTT.

(Same-RTT is a real qualifier — flows with longer RTTs grow more slowly per unit time and end up with smaller fair shares, a fact known as RTT unfairness. It's why downloads from a far-away server can struggle when competing on the same bottleneck against a nearby one.)

The geometric nature of multiplicative decrease is also why losing a single packet matters so much. The sender doesn't gently reduce; it halves. A flow that was running at 100 Mbps and sees one drop is now running at 50 Mbps. Recovering takes hundreds to thousands of RTTs of additive increase. This is why on long-RTT paths with small but nonzero loss rates, classical TCP throughput is dramatically lower than the path's actual capacity — the math of cwnd / RTT × loss probability is harsh.

How "loss" is detected

The sender notices congestion through loss, but loss has multiple flavors:

  • Triple duplicate ACK. The receiver sends an ACK whenever a segment arrives. If a segment is lost but later ones arrive, the receiver keeps ACKing the last in-order byte — meaning the sender sees three duplicate ACKs. This is fast retransmit: the sender resends the missing segment without waiting for RTO. Fast retransmit usually kicks in within ~1 RTT of the loss.
  • Retransmission timeout (RTO). If no ACKs come back at all, the sender's RTO timer fires and it retransmits the oldest unacknowledged segment. RTO is conservative — typically 2–3× the smoothed RTT plus a margin — and represents the "we lost the path entirely" case rather than a single drop.
  • Selective ACK (SACK). The receiver explicitly tells the sender which segments arrived and which are missing. This lets the sender retransmit only what's needed, not everything from the loss point onward. SACK is an extension to classic ACK semantics, present on essentially every modern stack.
  • Explicit congestion notification (ECN). Routers along the path mark packets to signal congestion before they have to drop them. The receiver echoes the marks back to the sender as a flag in the ACK. The sender treats the mark as equivalent to a loss for cwnd purposes, but no actual packet was lost. ECN deployment has been incomplete for decades; it works when both endpoints and the routers between them support it, which is increasingly common in 2026.

The key insight: classical TCP infers congestion from loss. A real lost packet (single drop, fast retransmit) and a router-signaled congestion mark (ECN) and a complete path failure (RTO) all reduce cwnd, though the magnitude of the reduction varies. RTO triggers a return to slow start with a small cwnd, because something significant has changed; fast retransmit just halves cwnd.

This is also where loss-based and delay-based controllers diverge. We'll get to BBR.

Reno, NewReno, and the classical mental model

Putting all the pieces together: classical Reno TCP (and its small refinement, NewReno) goes through phases:

  1. Slow start from cwnd = 10 MSS. Exponential growth per RTT until either loss is detected or cwnd hits ssthresh.
  2. Congestion avoidance. Linear growth (one MSS per RTT) until loss is detected.
  3. Fast retransmit / fast recovery. On three duplicate ACKs, retransmit, halve cwnd, set ssthresh = cwnd/2, return to congestion avoidance.
  4. Timeout. On RTO, set ssthresh = cwnd/2, set cwnd = 1 MSS (the original "go all the way back to slow start"), retransmit, return to slow start.

The cwnd timeline for a single loss event under Reno looks like a sawtooth:

       ↑
  cwnd |       ╱╲      ╱╲      ╱╲
       |      ╱  ╲    ╱  ╲    ╱  ╲
       |     ╱    ╲  ╱    ╲  ╱    ╲
       |    ╱      ╲╱      ╲╱      ╲
       |   ╱
       |  ╱   slow         ssthresh
       | ╱    start        threshold
       |╱_________________________________→
                                   time

Linear growth, multiplicative cut, repeat. Reno was simple, provably fair, and worked well for the bandwidth-delay products of the late 1990s. It worked poorly for the BDPs of the 2010s and beyond, because linear cwnd growth at one MSS per RTT recovers from a halving in O(half the previous cwnd) RTTs — and on a 1 Gbps link with 100 ms RTT, that's tens of seconds. Modern cloud-to-cloud transfers couldn't afford to spend that much time crawling back to capacity after every loss.

CUBIC

CUBIC, the standard since 2008 and the default on Linux since the 2.6.19 kernel, addresses Reno's slow recovery. The growth function is no longer linear — it's a cubic curve centered on the cwnd value at the time of the last loss event.

The behavior in three phases:

  • Concave growth. Right after a loss, cwnd grows quickly back toward the previous saturation point.
  • Plateau. Near the previous saturation point, growth flattens. This is the "I think I'm at capacity" zone.
  • Convex growth. Past the previous saturation, cwnd grows quickly again, probing for new capacity.

The cubic curve specifically: cwnd(t) = C(t - K)^3 + W_max, where W_max is the cwnd at the time of the last loss, K is a time constant, and C is a scaling parameter. The math is in the RFC. The intuition: CUBIC remembers where it was when it last got cut and rushes to get back there, then gets cautious near the remembered ceiling, then probes if the ceiling didn't reappear.

The two practical advantages over Reno:

  • Fast recovery on high-BDP paths. CUBIC reaches its previous cwnd in seconds rather than minutes, even on long-RTT high-bandwidth paths.
  • RTT-friendliness. CUBIC's growth doesn't depend on RTT to first order. Two CUBIC flows with different RTTs share more equally than two Reno flows would, mitigating Reno's RTT unfairness.

CUBIC has a real downside: it remains a loss-based controller. If the path has steady packet loss for reasons unrelated to congestion (radio interference, dirty fiber connector, lossy mesh link), CUBIC interprets every drop as congestion and backs off, even though the link isn't actually full. On wireless and noisy paths, this leads to underperformance. BBR addresses that.

BBR

BBR (Bottleneck Bandwidth and Round-trip propagation time) was developed at Google in 2016 and represents a fundamentally different control philosophy. Instead of inferring congestion from loss, BBR builds a model of the path:

  • BtlBw — the bottleneck bandwidth, estimated from delivered packet rate during recent ACKs.
  • RTprop — the round-trip propagation delay, estimated as the minimum RTT observed recently.

Given those two values, BBR computes a target sending rate and uses it to pace packets directly. The bandwidth-delay product BtlBw * RTprop becomes a soft estimate of how much data should be in flight. BBR doesn't halve cwnd on packet loss; it adjusts its estimates and continues.

The result is dramatically better performance on:

  • Wireless and noisy paths where loss isn't a congestion signal.
  • Long-RTT high-BW paths (common in cloud-to-cloud and CDN traffic) where Reno's recovery is too slow.
  • Bufferbloated paths where loss-based controllers overflow buffers before they back off (more on that next).

BBR also has known fairness issues: a BBR flow competing with a CUBIC flow on the same link doesn't always share fairly. The original BBR (BBRv1) was particularly aggressive against CUBIC, leading to industry concerns. BBRv2 and now BBRv3 (the current draft) include fairness mechanisms that improve coexistence, though the analysis is still evolving. As with all post-classical congestion controllers, deployment has been measured and contentious.

Bufferbloat and queue management

The standard congestion-control story assumes that routers drop packets when their queues are full. Bufferbloat is the modern reality: routers (and consumer modems, and home Wi-Fi APs) ship with enormously oversized buffers. A queue that holds 5 seconds of traffic at the link rate doesn't drop packets — it absorbs them all and lets latency balloon.

To a loss-based controller like CUBIC, no loss means no congestion, so cwnd keeps growing. The queue keeps growing. Latency keeps growing. Eventually the queue fills and packets drop, but only after RTT has gone from 20 ms to several seconds. Throughput is fine; latency is a disaster. Every other flow on the link suffers the latency hit.

This is the bufferbloat pattern and you can recognize it in practice when:

  • Throughput is fine but ping times balloon during a transfer.
  • Concurrent interactive traffic (SSH, voice, gaming) becomes unusable while a single file download is running.
  • The pattern resolves the moment the heavy flow finishes.

The fix is active queue management (AQM): instead of letting queues grow unbounded and then dropping at the tail, the router watches queue size and signals congestion (drop or ECN-mark) before the queue fills. The two algorithms you'll see in the wild:

  • CoDel ("controlled delay") tracks the time each packet spent in the queue. If the minimum sojourn time over a sliding window exceeds a threshold, CoDel drops or marks packets. CoDel is what Linux's default fq_codel qdisc implements at the host level.
  • CAKE is a more comprehensive scheduler that combines CoDel with fairness queueing and rate shaping. It's the recommended setting on most home/edge routers running OpenWRT or pfSense.

Enabling AQM on the bottleneck link of a network removes bufferbloat. Throughput stays the same, but latency-under-load drops dramatically. If you've ever set up a home network with fq_codel enabled on the WAN-facing interface and noticed your video calls stop stuttering, that's bufferbloat being defeated.

The deeper architectural lesson: congestion control is a partnership between sender algorithms and queue management at the bottleneck. Even the best sender-side algorithm fails if the bottleneck is hiding congestion behind massive buffers. AQM is what closes the loop.

What applications get wrong

Three patterns engineers reach for that don't work as well as the design intent:

Multiple parallel TCP connections. A common "speed up the download" hack is to open N parallel connections. Each connection runs its own AIMD; the aggregate gets N shares of the bottleneck. This works if you're competing against a single flow that's getting a one-fair share — you'll get N/(N+1) of the link instead of 1/2. It does not magically increase the link's capacity, and it introduces head-of-line blocking and complexity. HTTP/2 abandoned this in favor of multiplexing over a single connection. HTTP/3 over QUIC took it further with multiple streams that don't share a single ACK clock.

Naive retry storms. Application-layer retries that fire aggressively after transport-layer failures can recreate the 1986 collapse pattern in miniature. If 10,000 clients all see a transient backend error and immediately retry, the backend is now under 10,000+ extra requests at the worst possible moment. Exponential backoff, jitter, and circuit breakers exist for exactly this reason.

UDP-based protocols ignoring congestion control. A protocol on UDP that doesn't implement equivalent backoff is a free rider on the network. It gets unfair throughput compared to TCP flows because they back off and it doesn't, but only as long as nobody else is doing the same thing. RFC 8085 is the responsible-UDP guidance. QUIC implements full congestion control. Naive game protocols and ad-hoc telemetry shippers often don't, and they break when the network is stressed.

The unifying theme: TCP congestion control isn't optional politeness; it's the load-bearing structure of the public internet's stability. Anything you build on top of UDP — or anywhere you can break the implicit contract — should match TCP's politeness or be a network citizen the rest of us don't appreciate.

Hands-on exercise

Exercise 1 — Simulate slow start and AIMD in Python

Save and run:

"""Toy Reno cwnd simulation across multiple RTTs and one loss event."""

import matplotlib.pyplot as plt   # optional; remove if not installed

INITIAL_CWND = 10        # MSS
SSTHRESH      = 64       # MSS — initial threshold (will be reset on loss)
LOSS_RTT      = 30       # RTT at which we model a single loss event
TOTAL_RTTS    = 80

cwnd = INITIAL_CWND
ssthresh = SSTHRESH
phase = "slow_start"

trace = []
for rtt in range(TOTAL_RTTS):
    trace.append((rtt, cwnd, phase))

    # Loss event
    if rtt == LOSS_RTT:
        ssthresh = cwnd // 2
        cwnd = ssthresh
        phase = "congestion_avoidance"
        continue

    # Growth
    if phase == "slow_start":
        cwnd *= 2          # exponential per-RTT doubling
        if cwnd >= ssthresh:
            cwnd = ssthresh
            phase = "congestion_avoidance"
    else:  # congestion_avoidance
        cwnd += 1          # additive increase, ~1 MSS per RTT

# Print the trace
for rtt, c, p in trace:
    print(f"RTT {rtt:3d}  cwnd = {c:6d} MSS  ({p})")

# Optional: graph it.
try:
    rtts = [r for r, _, _ in trace]
    cwnds = [c for _, c, _ in trace]
    plt.step(rtts, cwnds, where="post")
    plt.xlabel("RTT"); plt.ylabel("cwnd (MSS)"); plt.title("Toy Reno cwnd")
    plt.savefig("/tmp/cwnd.png")
    print("graph saved to /tmp/cwnd.png")
except ImportError:
    pass

The output is the classic sawtooth: exponential growth to ssthresh, linear growth from there, then a sharp halving at the loss event, then linear growth again.

Stretch: add a CUBIC growth function — concave near the previous W_max, plateau, then convex past it. Plot CUBIC and Reno on the same axes; you'll see CUBIC recovers cwnd much faster after the loss event.

Exercise 2 — Inspect your host's congestion control

# Linux: which algorithm is the kernel default?
sysctl net.ipv4.tcp_congestion_control

# Which algorithms are available?
sysctl net.ipv4.tcp_available_congestion_control

# Live socket telemetry (Linux only):
ss -ti | head -40

The third command is the operationally interesting one. ss -ti (TCP info) shows per-socket congestion-control state. Look for:

  • cwnd: — current congestion window in segments.
  • ssthresh: — current slow-start threshold.
  • bbr_* fields — if BBR is in use, the model state.
  • rcv_rtt: — receiver-side RTT estimate.

For a busy long-running connection (a download or an SSH session that's transferring a lot of data), you can watch cwnd evolve with watch -n 1 'ss -ti dst <peer-ip>'. You'll see the sawtooth in real time.

To switch the default congestion control on Linux:

sudo sysctl -w net.ipv4.tcp_congestion_control=bbr

(BBR is built into mainline Linux kernels; CUBIC is the typical default. Switching is a per-host setting and persists across reboots if you write it to /etc/sysctl.d/.)

Common misconceptions

"Slow start is the low-speed safe mode." Slow start doubles cwnd every RTT — exponential growth. It's the most aggressive phase of the connection's life. The name is a 1988 historical artifact; the only thing slow about it is that it starts at a small initial value.

"TCP only slows down because one host is being cautious." Congestion control is a coordination protocol between every TCP sender on the same path. If only some senders backed off, the rest would dominate the bottleneck. Universal voluntary backoff is what keeps the shared network stable for everyone.

"More parallel TCP connections is always better." Parallel connections steal queue share rather than grow it. They speed up your transfer at the cost of your neighbors' transfers. HTTP/2's multiplexing and HTTP/3's streams exist precisely to avoid this hack.

"Bufferbloat means the link is too slow." Bufferbloat means the queue at the bottleneck is too large. Throughput is fine; latency is the disaster. The fix is active queue management (CoDel, CAKE, fq_codel), not a faster link.

"BBR is just a faster Reno." BBR uses a fundamentally different control philosophy: model bottleneck bandwidth and RTT explicitly, pace at that rate, ignore loss as a primary signal. It's not loss-based at all. It performs much better on noisy paths and high-BDP paths, but it competes differently against loss-based flows on shared bottlenecks.

Further reading

  1. RFC 5681 — TCP Congestion Control. The canonical Reno specification. Read it once; the algorithms have barely changed in 25 years.
  2. RFC 2914 — Congestion Control Principles. The protocol-social contract behind why congestion control exists at all.
  3. RFC 9438 — CUBIC for Fast and Long-Distance Networks. The current standard for the algorithm that's the default on every major OS.
  4. RFC 6298 — Computing TCP's Retransmission Timer. Loss recovery timing as part of congestion response, not separate trivia.
  5. RFC 9002 — QUIC Loss Detection and Congestion Control. A modern transport's translation of classic congestion ideas into a new framing model.
  6. BBR Congestion Control IETF Draft. The authoritative current description of BBR's model-based control logic.

The next module — DNS — name resolution end to end — moves up to the application layer and follows a single hostname lookup from your browser through the recursive resolver, root, TLD, authoritative, and back to your application — the protocol that ties the rest of the internet together.