RouteHardenHire us
Back to Encrypted Transport
Encrypted Transport · Part 6 of 7·Corporate Networks··30 min read·intermediate

Tailscale and WireGuard mesh

How WireGuard mesh VPNs actually work: coordination planes, node keys, NAT traversal, relays, subnet routers, and identity-based policy.

WireGuard the protocol gives you a tunnel between two peers who already know each other's public keys and endpoints. It does not give you a network. The leap from "I have a working WireGuard tunnel between two laptops" to "my whole company has a flat encrypted network where every device can talk to every other device, with NAT-traversal, identity-based access control, and zero manual key management" is large — and it is the leap that mesh-VPN systems like Tailscale, Headscale, Netbird, Tailnet, and Innernet were built to solve.

This module is the architectural treatment of mesh VPNs. We're going to look at what's actually inside a Tailscale-style system: the control plane that distributes identity and keys, the data plane that still flows over WireGuard, the NAT-traversal machinery, the DERP-style relay fallback, the identity-centric policy model, and the trust boundaries you accept when you adopt this architecture. We're not going to repeat product-comparison writeups — those already live at tailscale-vs-headscale-comparison and netbird-vs-headscale-for-teams. Here we'll focus on the structure that all these systems share, and the real tradeoffs they all make.

The headline thesis: mesh VPNs work by separating the control plane from the data plane. Encrypted user packets still flow peer-to-peer over WireGuard (or via relays when direct paths fail). What's centralized is identity, key distribution, peer discovery, NAT-traversal coordination, and policy. This separation is the source of the operational simplicity these systems are loved for, and also the source of the trust assumption that running one of these systems puts on the coordination service.

Prerequisites

Learning objectives

  1. Explain how a WireGuard-based mesh differs structurally from a point-to-point tunnel or a routed subnet VPN.
  2. Describe the role of the coordination plane, DERP-style relay infrastructure, node keys, NAT traversal, and policy evaluation in Tailscale-like systems.
  3. Distinguish transport encryption (data plane) from identity and policy (control plane) in mesh VPNs.
  4. Explain why mesh VPNs feel operationally simpler than raw WireGuard while introducing different trust boundaries.

Raw WireGuard versus managed mesh

Start with a thought experiment. You have 50 laptops, 30 phones, and 20 servers. You want every device to be able to reach every other device over an encrypted network, with users able to come and go, devices able to roam between networks, and access controlled by identity rather than IP.

The raw WireGuard approach:

  • Generate a keypair for each of the 100 devices. That's 100 private keys to distribute.
  • On each device, configure WireGuard with its own private key plus the public keys, endpoints, and AllowedIPs of every other peer. That's 99 peer entries per device.
  • When a new device joins, generate a keypair for it, distribute its public key to every other device's config, and distribute every other device's information to the new device. Every existing device's config needs editing.
  • When a device leaves, remove its peer entry from every other device's config.
  • When a device's IP changes (laptop moves to a new WiFi network), every peer's Endpoint for that device is now stale. WireGuard will eventually time out and re-handshake when the device sends from the new endpoint, but discovery is lossy and slow.
  • Access control: it's all-or-nothing. Once a peer is in your config with AllowedIPs = 10.0.0.0/24, that peer can talk to anything in that range. You can use host firewalls to limit per-port, but the WireGuard layer doesn't know about identities.

This is unworkable at scale. Even at 10 devices, the configuration burden is annoying. At 100, it's untenable. At 1000, you'd quit your job before finishing the spreadsheet.

The mesh-VPN approach:

  • Each device runs a small client that registers with a coordination service when it boots up.
  • The coordination service authenticates the user (typically via OIDC: Google, Microsoft, GitHub, etc.) and binds device-identity to user-identity.
  • Each device generates its own WireGuard keypair locally; only the public key ever leaves the device.
  • The coordination service maintains the global view: which devices exist, which keys they use, what IP each device has on the mesh, what NAT they're behind, what policy applies to them.
  • When device A wants to talk to device B, the coordination service tells device A: "B's WireGuard public key is X, B is currently reachable at endpoint Y, and yes you're allowed to talk to B based on policy P." Device A configures a WireGuard peer for B and traffic flows.
  • When B's network changes, B notifies the coordination service of its new endpoint, which pushes the update to anyone who has a path to B.
  • When a device leaves, the coordination service revokes its keys and updates everyone else's view.
  • Access control is policy-centric: "users in group engineers can SSH to devices tagged prod-db," expressed as a policy rule, evaluated against the actual identities of devices and users at connection time.

The data plane is still WireGuard. The bytes still flow peer-to-peer (mostly). What changed is that all the configuration coordination — keys, endpoints, peer lists, NAT traversal, policy distribution — moved to a centralized service that automates it. The WireGuard protocol itself is unchanged; the surrounding orchestration is what makes the mesh experience feel categorically different from raw WireGuard.

Control plane versus data plane

This is the most important architectural distinction in mesh VPNs, and the one most likely to be misunderstood. Let me state it bluntly:

Control plane: The coordination service. Runs in the cloud (Tailscale's coordination service, Tailnet's hosted system) or on infrastructure you operate (Headscale, Netbird, Innernet self-hosted). Manages identity, key registration, policy distribution, NAT-traversal coordination, and node state. Sees metadata: who exists, who is online, who is talking to whom (in the limited sense of "device A asked for device B's connection info"), what policies are in force.

Data plane: The actual encrypted user traffic. Flows directly between peer devices over WireGuard (or via DERP-style relays as fallback). The coordination service does not see this traffic. Its decryption requires WireGuard private keys that exist only on the endpoint devices.

Here's the architecture in ASCII:

        ┌─────────────────────┐
        │ coordination plane  │
        │  ─ identity & auth  │
        │  ─ keys (pubkeys)   │
        │  ─ NAT info         │
        │  ─ policy           │
        └─────────────────────┘
              ▲           ▲
              │ control   │ control
              │           │
        ┌─────┴─────┐ ┌──┴──────────┐
        │  device A │ │  device B   │
        └─────┬─────┘ └──────┬──────┘
              │              │
              └──── direct ──┘
              encrypted WireGuard
              data path (or via DERP relay
              if direct path is blocked)

The control-plane connection is HTTPS-based: each device maintains a long-lived connection to the coordination service to receive push updates. When something changes (new peer joined, an existing peer's endpoint changed, a policy was updated), the coordination service pushes the update over this control connection. Devices then locally reconfigure their WireGuard peer entries.

The data-plane connection is WireGuard-based: device A's wireguard kernel module / userspace implementation has a peer entry for B with B's public key and endpoint. Encrypted UDP packets flow directly between A and B. The coordination service does not act as a relay for this traffic.

The trust implications matter. The coordination service:

  • Can: Block devices from joining. Distribute false peer information that prevents devices from connecting to each other. See the metadata graph of who-is-trying-to-talk-to-whom (when devices ask for connection info). Push policy updates that change access permissions. With a hostile insider or a compromised service, push false WireGuard public keys for peers (which would prevent connections, since the false public key wouldn't match the real device's private key — but couldn't decrypt anything because that requires the real private key).
  • Cannot: Decrypt the data plane. Recover any device's private key (it never sees them). Read the contents of WireGuard sessions. Impersonate a device to its peers in a way that produces useful eavesdropping (would require both stealing private keys and convincing peers the new endpoint was legitimate, which fails the WireGuard handshake authentication).

This is genuinely important. People hear "centralized service" and assume "the operator can read your traffic." For Tailscale-style mesh VPNs, that's not true at the transport level. The operator can mess with you in many ways — see the trust-boundary section later — but they can't read what flows between two correctly-configured peers.

Node keys, login identities, and authorization

A device on a mesh has several distinct identities, and conflating them is a common source of confusion.

Node key (WireGuard private key). Generated locally on the device when it first joins the mesh. Never leaves the device. Used by the kernel WireGuard module to authenticate handshakes and decrypt session keys. If this leaks, an attacker with the leaked key can impersonate the device on the mesh until you revoke and rotate it.

Machine key (control-plane authentication key). A separate keypair used to authenticate the device to the coordination service over the control-plane HTTPS connection. Also generated locally; the public component is registered with the coordination service when the device first logs in. If this leaks, an attacker can impersonate the device to the coordination service (but still needs the node key to actually use the WireGuard mesh).

Login identity (user/OIDC account). The human-or-service-account identity that authorized this device to join the mesh. Typically tied to a Google/Microsoft/GitHub/Okta account via OIDC. The coordination service binds the device's machine key to this identity, recording "device X belongs to user U." Policy rules evaluate against this identity: "user U is in group G, group G can access these resources."

Auth key (one-time provisioning token). For unattended provisioning (a server that needs to join the mesh without interactive login), the coordination service can issue a long-lived or one-time auth key. The device presents the auth key on first connection, and the coordination service treats it as equivalent to an interactive login. Auth keys are scoped (which user identity they represent, what tags the device gets, expiry).

Tags (machine-not-user identity labels). A device tagged tag:prod-db is treated by policy as having that role rather than belonging to a specific user. Tags are used for servers, CI runners, kiosks, IoT — anything where the "owner" is the organization rather than a specific human. ACL policy can route on tags (e.g., tag:engineer can access tag:prod-db). Tag assignment is controlled by the organization's policy (only certain users can apply certain tags), preventing self-elevation.

When a new device joins:

  1. The local mesh client generates a fresh WireGuard keypair (the node key) and a fresh machine keypair.
  2. The client opens a browser to the coordination service's login URL.
  3. The user authenticates via OIDC. The coordination service learns the user's identity.
  4. The user (or auth-key flow) approves the device. The coordination service registers: "device with this machine pubkey, this WireGuard pubkey, belongs to this user, gets these tags."
  5. The coordination service assigns the device a mesh-internal IP from the configured pool (Tailscale's 100.64.0.0/10 is conventional, drawn from the CGNAT range to avoid conflicts with real internal networks).
  6. The device starts receiving control-plane updates: the public keys, endpoints, and IPs of other peers it's allowed to communicate with per current policy.

The lifecycle:

  • Re-key. Periodic node-key rotation can be configured. The new key is generated locally and registered; old key is revoked.
  • Re-auth. Periodic re-authentication of the user identity. If a user is removed from OIDC (they leave the company), their devices stop being able to refresh authorization and eventually drop off the mesh.
  • Revoke. A device or user can be revoked immediately by the admin via the coordination service; revocation propagates to all peers within seconds.

The model neatly separates "what is this device" (node key + machine key, both device-local) from "who does it belong to" (login identity, in the OIDC system) from "what is it allowed to do" (policy, evaluated against tags and identities). Each layer has clean rotation semantics.

NAT traversal and connection types

The NAT problem is what makes raw WireGuard hard for mesh deployments. Most devices in 2026 live behind NAT — laptops on home WiFi, phones on cellular data, even servers on corporate networks behind layers of corporate-managed NAT. Two devices behind separate NATs cannot directly send each other UDP without help.

Mesh VPNs implement what's effectively a STUN+ICE-style NAT-traversal coordination on top of WireGuard's data path. The flow:

  1. Discover own NAT type. Each device, on startup, contacts the coordination service (or DERP servers) and observes how its own outbound packets get NAT-translated. Common NAT classifications:

    • Open Internet: the device has a public IP, no NAT. Easy.
    • Cone NAT (full, restricted, port-restricted): the NAT preserves source-port mappings consistently. Other devices can punch through if they know the external (IP, port) tuple.
    • Symmetric NAT: the NAT picks a different external port for each destination. Hole-punching is much harder; often impossible without a relay.
    • Carrier-grade NAT (CGNAT): common on cellular networks; behaves like nested NAT, often symmetric.
  2. Register endpoints with coordination service. The device tells the coordination service its observed external (IP, port) and its local LAN address. The coordination service stores this and shares it with peers when they ask.

  3. Hole-punching attempt. When device A wants to connect to device B, the coordination service tells both A and B about each other simultaneously. Both devices send WireGuard handshake initiations to each other's external (IP, port). If the NAT permits, the simultaneous outbound flows create state in both NATs that allows the inbound side of the other's flow through. This is "hole-punching."

  4. Direct-connection success. If hole-punching works, the WireGuard handshake completes and traffic flows directly. The coordination service is no longer in the data path.

  5. Direct-connection failure → DERP relay fallback. If the NATs are too restrictive (symmetric NAT on both sides, or aggressive firewalling), hole-punching fails. The mesh client falls back to DERP — a Tailscale-style relay network that forwards the WireGuard traffic. The relay sees only encrypted WireGuard packets (the relay can't decrypt them — it doesn't have the WireGuard keys), but it does see metadata: which two endpoints are talking, when, and roughly how much.

DERP (Designated Encrypted Relay for Packets, Tailscale's name) is a key piece of architecture. Tailscale operates DERP servers globally; Headscale and other self-hosted variants need you to either run your own DERP or accept that hole-punching-failed connections will degrade. Connection types in the Tailscale UI:

  • Direct (UDP): hole-punching succeeded; WireGuard flows peer-to-peer.
  • Direct (LAN): both devices on the same LAN; WireGuard flows over the LAN directly without going out and back through NAT.
  • DERP (relay): hole-punching failed; WireGuard flows through a DERP relay.

The transition is seamless. A connection might start as DERP (when both peers come online), then upgrade to direct UDP once hole-punching completes (typically within seconds). If a network change later breaks the direct path, the connection falls back to DERP and tries to re-establish direct.

The relay sees encrypted bytes, not plaintext, so confidentiality is not lost. What is lost is some performance (relay adds latency) and some metadata privacy (the relay operator can see who-is-talking-to-whom, when, and how much). This is one of the trust considerations of using a hosted DERP versus self-hosting.

Subnet routers, exit nodes, and mesh extensions

A mesh of N devices where every device only talks to mesh-internal IPs is the simplest configuration. Real deployments need to extend the mesh in two ways:

Subnet router (a.k.a. relay node, gateway): A device on the mesh that advertises a non-mesh subnet (e.g., your office LAN's 192.168.1.0/24) and forwards mesh traffic into that subnet. Other mesh devices can then reach 192.168.1.x addresses via the subnet router, even though those addresses aren't part of the mesh. This is how you bring legacy non-mesh devices (printers, IoT, servers that can't run the mesh client) into the network.

The subnet router runs a normal mesh client plus enables IP forwarding in the kernel. The coordination service is told "this device handles 192.168.1.0/24"; mesh clients then add a route entry pointing at the subnet router. From the perspective of a mesh client, the legacy subnet looks like just another reachable mesh range.

Configuration is a single line on the subnet router (tailscale up --advertise-routes=192.168.1.0/24 for Tailscale-style clients) and admin approval in the coordination service to actually accept the advertised route (otherwise rogue devices could advertise overlapping subnets).

Exit node: A device on the mesh that other mesh devices can use as a default-gateway egress for non-mesh traffic. Set one device's mesh client to "exit node" mode; other clients can then optionally route their default route through that exit. Use cases: appearing to come from a particular IP for geo-restricted services, routing all traffic through a trusted home server, or providing controlled egress for a remote-worker device.

This is where mesh VPNs start to overlap with conventional VPNs. An exit node behaves like a VPN concentrator — all your traffic egresses from there. The difference is that the exit choice is per-device and dynamic, not a config-file commitment, and the exit doesn't see your other mesh traffic (only the traffic you explicitly route through it).

Magic DNS / mesh DNS: The coordination service operates an internal DNS resolver for mesh devices. A device named alice-laptop is reachable as alice-laptop (or alice-laptop.tail-net-name.ts.net in Tailscale's case) from any other device on the mesh. This eliminates the need to remember IPs.

The mesh DNS resolver also handles split-horizon DNS for organizations: internal corporate DNS can be served via the mesh (resolve wiki.internal.company.com to a mesh-routable IP), while external DNS goes to the regular internet. See split-dns-for-internal-services for the broader split-DNS pattern.

Policy as identity rather than subnet trust

Traditional VPN access control is subnet-based: "users connecting via VPN can reach 10.10.0.0/16, but not 10.20.0.0/16." This works but is brittle — moving a service to a different subnet changes the access calculation, and it's hard to express user-or-role-based distinctions when the only handle you have is "is your packet from this CIDR?"

Mesh VPNs use identity-centric ACL policy. A simple Tailscale-style ACL example:

{
  "groups": {
    "group:engineers": ["alice@example.com", "bob@example.com"],
    "group:on-call":   ["alice@example.com", "carol@example.com"]
  },
  "tagOwners": {
    "tag:prod-db":   ["group:engineers"],
    "tag:prod-web":  ["group:engineers"]
  },
  "acls": [
    { "action": "accept", "src": ["group:engineers"], "dst": ["tag:prod-web:80,443,22"] },
    { "action": "accept", "src": ["group:on-call"],   "dst": ["tag:prod-db:5432,22"] },
    { "action": "accept", "src": ["tag:prod-web"],    "dst": ["tag:prod-db:5432"] }
  ],
  "ssh": [
    { "action": "accept",
      "src": ["group:engineers"],
      "dst": ["tag:prod-web", "tag:prod-db"],
      "users": ["root", "ubuntu"] }
  ]
}

Reading this:

  • Two groups defined: engineers (alice, bob) and on-call (alice, carol).
  • Two tags can be applied (only by members of the engineers group): prod-db and prod-web.
  • Engineers can connect to prod-web devices on ports 80, 443, and 22.
  • On-call rotation can connect to prod-db devices on Postgres port 5432 and SSH port 22.
  • Prod-web devices can connect to prod-db devices on Postgres port 5432 only.
  • Engineers can SSH to prod-web and prod-db as user root or ubuntu.

What's notable: nowhere in this policy do CIDRs appear. The policy reasons in terms of who and what (users, groups, tags), not where (subnets, IPs). This is a cleaner mental model for most access-control intentions — "engineers can access prod" is what you actually mean, and the policy expresses it directly.

The mesh evaluates policy at connection time. When alice's laptop tries to open a TCP connection to a tag:prod-db device, the local mesh client checks: "is alice in a group that can access prod-db on this port?" If yes, the connection is allowed (and the WireGuard peer entry was distributed in the first place because policy allowed it; if policy didn't allow it, alice's laptop wouldn't even know prod-db's mesh IP). If no, the local client refuses the connection without the destination ever seeing it.

This is enforced on both sides: the source rejects the connection if policy forbids it; the destination also checks and rejects (defense in depth). Policy is pushed from the coordination service to all relevant endpoints; both endpoints have the same view of policy when evaluating a given connection.

The downsides:

  • Centralized policy. All devices must agree on policy state. A coordination-service outage that prevents policy distribution can degrade the mesh (existing connections stay up; new connections may fail to authorize).
  • Latency on connection. First connection between two peers requires a quick policy check; usually fast (cached locally) but can add tens of milliseconds in degenerate cases.
  • Policy expressiveness limits. ACL languages are simpler than full programming languages; complex conditional logic ("alice can access prod-db only between 9am and 5pm Monday through Friday") may not be expressible in basic ACL syntax. Some mesh systems extend with attribute-based access control or external policy evaluators (OPA-style integration); this adds complexity.

For most teams, the tradeoff favors the identity model. Subnet-based policy in a 50-person engineering org is unmanageable; identity-based policy in the same org is straightforward.

Trust boundaries and control-plane power

The control plane has substantial power, even though it doesn't see plaintext data. Worth being explicit about what it can and cannot do, because "Tailscale can't read your traffic" is a true but incomplete statement.

What the control plane can do:

  • Block any device from joining the mesh.
  • Block any pair of devices from establishing connections (by withholding peer info).
  • Push policy updates that change who can talk to whom.
  • See the mesh topology graph: which devices exist, which user identities own them, what tags they have, when they last checked in, what their NAT type is, what their public-internet endpoints are.
  • See connection-establishment metadata: when device A asked for device B's connection info, when DERP relays carried traffic between A and B (if relay is used).
  • For DERP relays specifically: see source and destination of relayed packets, see packet sizes and timing, see total volumes. Cannot decrypt the WireGuard payload.

What the control plane cannot do:

  • Decrypt the data plane. WireGuard private keys exist only on endpoint devices.
  • Recover any device's WireGuard private key. Keys are generated locally and never leave the device.
  • Read the contents of any direct (non-DERP) connection between two peers. The control plane is not in the data path for direct connections.
  • Forge new identities without the OIDC provider's cooperation. Logging in requires authenticating to the configured OIDC provider, which the control plane does not control.
  • Selectively decrypt some sessions while leaving others intact. The cryptography doesn't admit partial breakage.

Mitigations for control-plane trust:

  • Tailnet Lock (Tailscale's name) requires that any new node-key registration be signed by an existing trusted node before peers will accept it. This means even a compromised coordination service cannot inject a malicious device into the mesh — the existing devices' policy refuses to trust new devices that aren't co-signed by existing ones. Headscale and other systems have analogous features.
  • Self-hosting the control plane moves the trust to your own infrastructure. You're now responsible for the security of the control plane, but you control it. Headscale, Netbird's self-hosted variant, Innernet, NetMaker, and others all support this.
  • Audit logging of admin actions in the coordination service. Most systems log who changed what policy when. This is usually accessible only to admins, but valuable for forensic analysis.
  • End-to-end application-layer crypto doesn't depend on the mesh trust at all — your TLS and SSH still authenticate at the application layer. The mesh is one layer of defense; application-layer auth is another. A compromised mesh control plane doesn't break TLS that goes over the mesh.

For most threat models — small to medium teams, infrastructure access for engineers, replacing legacy site-to-site VPNs — the control-plane trust assumption is acceptable and the operational benefits are large. For threat models where the control plane being compromised would be catastrophic (e.g., national-security-relevant operations, organized adversaries with the resources to compromise hosted services), self-hosting and Tailnet-Lock-style co-signing become important. Either way, understanding what the control plane can and cannot do is a prerequisite for accurately reasoning about the mesh's security properties.

Operational ergonomics

The reason teams adopt mesh VPNs comes down to a few specific operational properties:

No manual key distribution. New device joins via OIDC login; key exchange happens automatically. Compare to raw WireGuard's "edit the config on every existing device when a new one joins."

Roaming works. A laptop moves from home WiFi to coffee shop WiFi to office WiFi without losing mesh access. The endpoint update goes through the control plane; peers learn the new endpoint and re-handshake. With raw WireGuard, the operator would have to manually update endpoints or live with PersistentKeepalive timeouts.

Onboarding is fast. New employee joins, gets added to the OIDC group, installs the mesh client, logs in, gets access to the resources their group can see. No spreadsheet of WireGuard configs, no manual ACL edits per-device.

Offboarding is fast. Remove the user from OIDC; their devices lose mesh access on next refresh (within minutes, not days).

Default-allow-nothing. Mesh-internal connections require explicit policy permission. A new device has access to nothing by default; admins explicitly grant access. Compare to flat VPN where joining the VPN often grants broad access by default.

SSH integration. Tailscale SSH and similar features let the mesh handle SSH authentication, replacing per-host SSH key management with mesh-identity-based SSH. Engineers SSH to tag:prod-web and the mesh authenticates them as the right user without per-server key distribution. Some teams find this transformative; others find it adds another vendor dependency to a previously-simple system.

Magic DNS. Devices reachable by name without manual /etc/hosts editing.

The trade-offs:

  • Vendor dependency. If you use the hosted variant (Tailscale's coordination service), you depend on the vendor's uptime. Outages have happened; existing connections stay up but new connections and policy updates wait. Mitigation: self-host (Headscale) if uptime is critical.
  • Coordination-plane attack surface. As discussed, the coordination service has substantial power. Compromise has serious implications.
  • Metadata exposure. The coordination service knows the mesh topology, which is sensitive in some threat models. Self-hosting addresses this; using hosted requires accepting it.
  • DERP usage. Performance degrades when direct connections fail and DERP is used. For most networks this is rare; for highly-restrictive corporate networks or symmetric-NAT cellular, it can be common.
  • OIDC dependency. The mesh is only as secure as your OIDC. A compromised OIDC account compromises the device. (Same as any modern cloud-tools setup, but worth being explicit about.)

For zero-trust-for-small-teams-style deployments — where access is identity-based, network position is irrelevant, and every connection is authenticated and authorized — mesh VPNs are a natural foundation.

Hands-on exercise

Map the control plane and data plane of one mesh event.

Tools: text analysis. Runtime: 10 minutes.

Narrate the following scenario:

Alice's laptop joins the mesh for the first time. After joining, she opens an SSH session to a server tagged prod-web from her home WiFi.

Expected narration:

[control plane events]
1. Alice runs the mesh client. It generates a fresh WireGuard keypair (node key)
   and a fresh machine keypair locally.
2. Mesh client opens a browser to the coordination service login URL.
3. Alice logs in via OIDC (e.g., Google). The coordination service learns
   "this person is alice@example.com, member of group:engineers."
4. The coordination service registers the device:
   - assigns a mesh IP (e.g., 100.64.0.42)
   - records the WireGuard public key
   - records the machine public key
   - associates the device with alice@example.com
   - applies the policy that alice's group is allowed to access tag:prod-web on port 22
5. The coordination service pushes the new policy / peer info to all existing devices that
   should be aware of alice's laptop (devices tagged prod-web, in this case).
6. The coordination service pushes to alice's laptop the list of peers she's
   allowed to see (the prod-web devices, with their public keys, mesh IPs, and
   current network endpoints).
7. The coordination service also handles NAT-type discovery: alice's laptop
   contacts the coordination service from her home WiFi external IP; the service
   notes the external (IP, port) for hole-punching purposes.

[data plane events for the SSH connection]
8. Alice runs `ssh prod-web-1`.
9. The local DNS subsystem resolves `prod-web-1` via mesh DNS to its mesh IP
   (e.g., 100.64.0.10).
10. The kernel routes the TCP connection to the mesh interface (TUN device managed
    by the mesh client).
11. The mesh client recognizes this is a peer it has WireGuard config for, and
    sees from the policy cache that the connection is allowed.
12. WireGuard handshake initiation packet is sent to prod-web-1's external endpoint
    (which alice's laptop learned from the coordination service).
13. Simultaneously, prod-web-1 attempts to send to alice's laptop's external endpoint
    (which was pushed to it during the policy update). Hole-punching succeeds
    if both NATs cooperate.
14. WireGuard handshake completes. Encrypted data plane is now direct between
    alice's laptop and prod-web-1.
15. SSH bytes flow through the WireGuard tunnel. Coordination service sees
    none of this traffic.

[fallback case]
If hole-punching had failed (e.g., alice was on cellular with symmetric NAT):
14b. WireGuard handshake initiation packets get blocked by NAT in both directions.
15b. After a few seconds of failure, mesh client falls back to DERP.
16b. Both alice's laptop and prod-web-1 maintain TLS connections to a DERP
     relay server.
17b. Encrypted WireGuard packets get sent via the DERP relay; the relay forwards
     them. The relay sees encrypted bytes plus metadata (who-talks-to-whom),
     not plaintext.
18b. Mesh client continues to attempt direct hole-punching periodically; if
     it eventually succeeds, the connection upgrades to direct.

The point: the control plane orchestrates everything that prepares the connection; the data plane is what carries the connection. Step 7 (NAT discovery) and step 13 (hole-punching) are the magic that makes mesh VPNs feel "just work" in environments where raw WireGuard would silently fail.

Read one ACL/tag policy example.

Tools: text editor. Runtime: 10 minutes.

Take the policy from the "policy as identity" section above. Answer:

  • Who can access tag:prod-db on port 5432?
  • Can tag:prod-web access tag:prod-web on any port? (Hint: the ACL is permissive — connections are denied unless explicitly allowed. Look for a rule allowing it. Spoiler: there's no such rule, so prod-web devices can't talk to each other through the mesh.)
  • If alice's laptop is unhealthy and you want to revoke her access immediately, what's the smallest change to this policy that achieves it?
  • If you want carol to be able to SSH to prod-web devices, what change is needed?

Stretch: explain why subnet-based VPN ACLs ("CIDR X can reach CIDR Y") would struggle to express the same intent cleanly, especially as the team grows and as service deployments shift between subnets.

Common misconceptions and traps

"Tailscale is just WireGuard with a UI." No. The coordination plane (identity, policy, NAT-traversal coordination, DERP relays, SSH integration, magic DNS) is most of the product. WireGuard is the data-path substrate; the rest is what makes the mesh experience.

"If packets are end-to-end encrypted, the control plane doesn't matter." The control plane still governs who joins, who learns about whom, what policies apply, and who can connect to what. A compromised control plane can block legitimate access, push misleading peer info, or expose mesh topology metadata — even though it can't decrypt user traffic.

"A relay means the encryption is broken." DERP relays carry encrypted WireGuard packets; the relay cannot decrypt them. What the relay sees is metadata (who-to-whom, when, how much). Confidentiality of payloads is preserved; metadata privacy is reduced.

"Mesh policy is equivalent to routing policy." They overlap but aren't identical. Routing policy decides which IPs go where; identity policy decides which identities can connect to which identities (regardless of where they're physically located). A mesh policy that permits group:engineers → tag:prod-db doesn't care that engineers are roaming from coffee shop to coffee shop or that prod-db is in AWS us-east-1 — the identity-based decision is location-independent.

"Self-hosting the coordination plane removes all trust concerns." It moves the trust boundary and the operational burden. You now trust your own ops team to run the coordination plane securely; you accept the responsibility for keeping it patched, monitored, and recoverable from compromise. Self-hosting is not a magic solution — it's a different set of tradeoffs.

"The mesh client running on every device is a security risk." It is software running with elevated privileges, so any code-quality bug is potentially serious. Mitigation is the same as for any kernel-adjacent software: rely on the project's security track record, keep it updated, monitor for CVEs. Mesh-VPN clients have had vulnerabilities (as has every comparable software project); both Tailscale and major open-source alternatives have functioning security-disclosure processes.

"Mesh VPNs replace firewalls." They complement firewalls, they don't replace them. Defense in depth still matters. The mesh adds an authentication layer to network connections; host firewalls add a second authorization layer on the destination side. Most mature deployments use both.

"My mesh is slow because it's a mesh." Probably not. Mesh VPNs over direct WireGuard are fast — usually within 5-10% of bare WireGuard performance. If your mesh is slow, the most common causes are (1) you're going through DERP because hole-punching failed (check connection type), (2) MTU mismatches causing fragmentation, (3) one of the peers has limited CPU and is the bottleneck, or (4) the underlying network is the bottleneck (a mesh over a 100 Mbps WAN is limited to 100 Mbps).

Wrapping up

Mesh VPNs separate WireGuard's data-plane transport from a centralized control plane that handles identity, key distribution, NAT traversal, peer discovery, and policy. The result is the operational simplicity that raw WireGuard lacks at scale, with end-to-end encryption preserved (the control plane never sees plaintext), at the cost of trusting the coordination service for everything except actual data confidentiality.

Understanding the architecture — what the control plane is, what it can and cannot do, what policy looks like, how NAT traversal and DERP fallback fit together — is what lets you reason about whether a mesh VPN fits your threat model and your deployment shape. For most modern teams (engineers needing flexible access to infrastructure, distributed organizations replacing legacy site-to-site VPNs, small teams adopting zero-trust patterns without a budget for enterprise tooling), mesh VPNs are the right answer. For deployments where the control-plane trust assumption is unacceptable, self-hosted variants (Headscale, Netbird self-hosted, NetMaker) preserve the architectural pattern while moving the trust to your own infrastructure.

The next module (mtls-and-zero-trust-transport — coming soon) goes one layer up: how mTLS-based zero-trust deployments use mutual-TLS authentication to do at the application/transport layer what mesh VPNs do at the network layer, and where the two patterns complement each other versus compete.

Further reading