Get the course
Back to field notes
Self-Hosted Infrastructure··17 min read

AWS Lightsail Data Transfer Quotas: A Postmortem on Cross-Instance Pool Accounting

Five gotchas that turn an innocent Lightsail VPN deployment into a surprise bill — pool is per-region+per-bundle (not per-instance), delete+recreate inherits usage, stopped instances still bill bundle, per-instance metrics lie after delete, IPv6 sysctl doesn't catch already-up interfaces. With the kill switches, budgets, and CloudWatch alarms that would have prevented all of it.

A multi-region Lightsail fleet of cheap $12/mo instances ran into a wall of "exceeded quota" warnings, an unrecoverable instance delete, and a bill that climbed for 18 hours straight even after the offending box was off. Final damage: about $30 in overage, $24 in bundle hours, plus a lost weekend. None of it was unavoidable — every line item maps to a documented AWS behavior most engineers don't internalize until they get burned. This post is the writeup.

TL;DR

If you're running Lightsail at any scale, internalize these five rules:

  1. Pool is per-region + per-bundleId, not per-instance. Two small_3_0 instances in the same region share one 3 TB pool. Same small_3_0 in different regions = two independent 3 TB pools.
  2. Both inbound AND outbound count toward the pool. But only outbound is billed for overage. So your dashboard meter can be at 100% while your wallet says "fine" — until OUT also crosses the line.
  3. Delete-and-recreate within the same month inherits pool usage. A fresh instance does NOT get a fresh quota.
  4. Stopped instances still bill the bundle. Only deletion stops that meter.
  5. Per-instance GetInstanceMetricData is blind to deleted predecessors. A panel reading from this API will say "0 GB" while the regional pool is at 105%. Cost Explorer is the authoritative source.

Plus three bonus footguns:

  • IPv6 sysctl is per-interface. Setting net.ipv6.conf.all.disable_ipv6 = 1 does NOT retroactively disable v6 on an interface that's already up. After a snapshot restore, dual-stack instances come up with public IPv6 enabled even though sysctl says it should be off.
  • Cost Explorer API costs $0.01 per call. A naive monitoring panel that polls every minute = $14/month in CE charges.
  • Billing lag is real. The Bills page is 6-12h behind, Cost Explorer is up to 24h behind, CloudWatch is 5-15 min behind. The bill rises long after you've stopped the instance because of post-deletion byte reporting.

The setup

A four-node VPN fleet on Lightsail:

RoleAWS regionBundlePool
US primaryus-east-2small_3_03 TB/mo
Asia (Tokyo)ap-northeast-1small_3_03 TB/mo
US secondary (Oregon, added mid-month)us-west-2small_3_03 TB/mo
UK-labeled (physically in us-east-2)us-east-2nano_3_01 TB/mo

Each instance runs xray-core with VLESS+Reality, egresses via a residential-proxy provider. Real client IPs only exist inside the encrypted tunnel between client and instance. Standard self-hosted setup; nothing exotic.

The incident started around day 20 of the month, when a control panel showed Tokyo at "114.96% of 3 TB" and projected end-of-month at "177.8% of 3 TB."

What I thought was happening

I read "114.96% of 3 TB" and assumed:

  • AWS was about to charge overage on every byte past 3 TB
  • At Lightsail's Tokyo rate of $0.14/GB, projected damage was $300+
  • I needed to slam the brakes immediately

I stopped xray on the box, opened the AWS Billing page in panic, and found... $4.37 in overage charges. Not $300. Less than the cost of lunch.

This is the moment I learned Rule #2: both IN and OUT count toward the pool, but only OUT-overage is billed.

Gotcha #1: IN+OUT count toward the pool; only OUT past quota costs money

Right from the official AWS Lightsail FAQ:

Both data transfer IN and data transfer OUT of your instance count toward your data transfer allowance. If you exceed your data transfer allowance, you will only be charged for the excess data transfer OUT.

So the actual billing model is:

quota = 3072 GB combined (IN + OUT) per instance-month per bundle per region
                                              ↑ note all four scopes

before_quota_crossed:   any traffic            $0
after_quota_crossed:    OUT bytes only         $0.14/GB
                        IN bytes still         $0

The control panel I'd built was reading the AWS CloudWatch NetworkIn and NetworkOut per-instance, summing them, and comparing to 3 TB. The summary number was technically correct for the pool meter — but it was not the number that determined the bill. A panel that puts a single "GB used" figure next to a quota implies "this is what you'll be billed against." That's wrong for Lightsail.

What you actually want to display is two numbers:

Pool used:  3,233 GB / 3,072 GB  (105.3% — yes, over)
$ billed:   $21.20                 (151 GB of OUT past quota × $0.14)

Those numbers tell two completely different stories. The first says "you crossed the pool line." The second says "here's the actual money."

In my case, of 161 GB of combined-overage at the time, only 31 GB was outbound. The other 130 GB was inbound (from the residential proxy back through the VPS to the client). Free even when over. The actual money damage was $4.37, not $25.

Gotcha #2: The pool is per-region AND per-bundleId, not per-instance

This is the gotcha that cost real money.

From the AWS FAQ (Example 2):

If you have two instance bundles (bundleId nano_3_0) for a full month in a region, each with 1 TB per month data transfer allowance, you get 2 TB data transfer allowance in aggregate.

And Example 3 clarifies the per-bundle part:

You create two sets of instance bundles: set A with two instance bundles (bundleId nano_3_0) and set B with three instance bundles (bundleId micro_3_0), both in the same region. In aggregate, this gives you 2 TB of data transfer allowance for set A, and 6 TB for set B.

The rule, distilled:

pool_GB = sum_over_instances(instance.bundle.transferGB)
          for all instances where
            instance.region   == this_region
            instance.bundleId == this_bundleId

Implications most engineers miss:

  • Same bundle, same region → pooled. Two small_3_0 in us-east-2 share one 6 TB pool.
  • Same bundle, different region → independent. small_3_0 in us-east-2 and small_3_0 in us-west-2 each have their own 3 TB.
  • Different bundles, same region → independent. small_3_0 and nano_3_0 in us-east-2 are completely separate pools.

I'd been treating the four-node fleet as "four times 3 TB = 12 TB total quota I can spread however." Wrong. The correct mental model is:

us-east-2  small_3_0:  3 TB (US-primary alone in its pool)
us-east-2  nano_3_0:   1 TB (UK-labeled alone in its pool)
ap-northeast-1 small_3_0: 3 TB (Tokyo)
us-west-2  small_3_0:  3 TB (Oregon)

Four independent pools. Four separate quota meters. The UK-labeled instance and US-primary live in the same region but never share a pool because they're different bundles.

Gotcha #3: Delete + recreate within the same month INHERITS pool usage

This is the gotcha that turned a minor incident into a real bleed. Here's the FAQ verbatim (Example 5):

If you delete all your instances, and create new instances of the same bundle in the same Region within the same billing month, your data transfer utilization will still be [the prior amount] and remaining data transfer allowance will still be [whatever was left].

And Example 6:

After using your monthly data transfer allowance and accruing overage, if you now create another new instance of the same bundle, you will still be charged the data transfer OUT overage fee previously accrued. Further data transfer OUT through these instances will continue to accrue overage fees.

In other words: the pool counter is per-region-per-bundleId and it does NOT reset when you delete and recreate. The new instance starts at the OLD instance's accumulated usage. Every byte it sends past the (already crossed) quota is immediately billed at the overage rate.

My incident: I deleted Tokyo at 95% of pool to "stop the bleed", then restored from snapshot ~1 hour later because of an unrelated operational need. The restored instance immediately occupied a pool that was already past quota. Every byte it sent was overage-rate.

The CloudWatch GetInstanceMetricData API for the new instance showed NetworkOut = 0.08 GB total (the new instance's actual lifetime). The bill at the same moment showed APN1-DataXfer-Out-Overage-Bytes = 151 GB at $21.20. The new instance "did" 0.08 GB; the regional pool charged for 151 GB. The 151 GB came from the deleted instance, still being reported by AWS billing 12-24h after the delete.

This is the bug that made my control panel useless: it polled per-instance metrics, found near-zero on the new instance, and confidently displayed "0% of 3 TB used." Meanwhile the pool meter (which Cost Explorer can show, but I wasn't yet querying it) showed 105%.

The fix is to read the pool number, not the per-instance number. More on that below.

Gotcha #4: Stopped Lightsail instances still bill the bundle

RUNNING instance  → bundle bills at $X/hour, NIC bills for data transfer overage if any
STOPPED instance  → bundle STILL bills at $X/hour, no data transfer happens
DELETED instance  → bundle stops billing immediately, no more charges accrue

If you stop a Lightsail instance to "pause" it for a month, you continue paying the bundle. Stopping only saves the data transfer side of the bill (which is usually small for an idle box). For a $12/mo small_3_0 that you stop on day 20 and intend to start again on the 1st, you'll still pay $4 for those 10 days.

If you genuinely want $0/month, delete the instance and keep the snapshot. Lightsail snapshots cost $0.05/GB-month — for a 60 GB SSD instance with say 5 GB of actual state, that's $0.25/month. Restore takes about 3 minutes when you need the box back.

Gotcha #5: Per-instance metrics lie after delete + recreate. Cost Explorer is authoritative.

The Lightsail API has two relevant data-transfer surfaces:

lightsail:GetInstanceMetricData --metric-name NetworkOut
  → per-instance, real-time-ish (5-15 min lag), DIES with the instance
  → useless for "regional pool consumed this month"

ce:GetCostAndUsage --filter Service=Lightsail --group-by USAGE_TYPE
  → per-account, per-region, per-usage-type
  → has up to 24h lag but SURVIVES instance lifecycle
  → returns line items like APN1-TotalDataXfer-Out-Bytes, APN1-DataXfer-Out-Overage-Bytes

If your monitoring code reads only per-instance metrics, it's blind to:

  • Deleted instances' contributions to this month's pool
  • Other instances of the same bundleId in the same region

Both of those still count toward your bill. Cost Explorer is the only API that tells the truth.

A correct pool query, expanded:

aws ce get-cost-and-usage \
  --region us-east-1 \
  --time-period Start=$(date -u +%Y-%m-01),End=$(date -u -v+1m +%Y-%m-01) \
  --granularity MONTHLY \
  --metrics UsageQuantity UnblendedCost \
  --filter '{
    "And":[
      {"Dimensions":{"Key":"SERVICE","Values":["Amazon Lightsail"]}},
      {"Dimensions":{"Key":"USAGE_TYPE_GROUP","Values":["Lightsail: Data Transfer"]}}
    ]
  }' \
  --group-by Type=DIMENSION,Key=USAGE_TYPE

The output groups by usage type. The prefix tells you the region:

APN1-TotalDataXfer-In-Bytes      1,677 GB      $0.00     ← Tokyo inbound (free)
APN1-TotalDataXfer-Out-Bytes     1,709 GB      $0.00     ← Tokyo outbound in-quota
APN1-DataXfer-Out-Overage-Bytes    151 GB     $21.20     ← Tokyo outbound over
USE2-TotalDataXfer-In-Bytes        372 GB      $0.00     ← Ohio inbound
USE2-TotalDataXfer-Out-Bytes       371 GB      $0.00     ← Ohio outbound (in-quota)
USW2-BundleUsage:2GB                14 hrs     $0.23     ← Oregon (just started)

To compute pool consumed for a single-bundle region:

pool_used_gb = in_bytes_gb + out_in_quota_gb + out_overage_gb
pool_quota_gb = transferPerMonthInGb  # from lightsail:GetBundles
pool_pct = pool_used_gb / pool_quota_gb * 100

For regions with multiple bundles of different types, you can't separate them from Cost Explorer alone — fall back to per-instance metrics for each bundle's slice and accept the limitation that deleted instances' contributions are merged into the regional total.

One important caveat: Cost Explorer charges $0.01 per request. Cache aggressively. My recommendation is one hour TTL on the panel; that's about $0.30/month even if you stare at the panel constantly.

The bonus gotcha: IPv6 sysctl is per-interface, not retroactive

The Lightsail "Dual-stack" networking option is enabled by default for most bundles. That means every instance comes up with both a public IPv4 and a public IPv6 address. Even when your /etc/sysctl.d/99-disable-ipv6.conf says:

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

The default and all settings only apply to interfaces that come up after sysctl is loaded. The instance's primary network interface (ens5 on Lightsail) is up before sysctl runs, so it keeps its IPv6 address and route. Your firewall rules (UFW) may have v6 ALLOW rules for SSH and your application port. The result: your "v6-disabled" box has a public v6 address with v6 SSH and v6 :443 open to the world.

The fix is per-interface:

# Live (until reboot):
sudo sysctl -w net.ipv6.conf.ens5.disable_ipv6=1
sudo ip -6 route flush dev ens5

# Persist:
echo "net.ipv6.conf.ens5.disable_ipv6 = 1" | sudo tee -a /etc/sysctl.d/99-disable-ipv6.conf

# Belt-and-suspenders: drop all v6 inbound at the firewall
sudo ip6tables -P INPUT DROP
sudo ip6tables -P FORWARD DROP

After this, ip -6 addr show dev ens5 shows no inet6 addresses and ip -6 route show default is empty. The instance is genuinely v4-only.

This gotcha bites HARDEST when restoring from a snapshot. The snapshot contains your hardened sysctl config, but Lightsail's boot sequence still brings up ens5 with v6 before running sysctl. So a "hardened" snapshot restored to a "dual-stack" network type ends up v6-reachable until you re-apply the per-interface kill.

How to monitor correctly

The defensive architecture I ended up with — after the incident — has four layers:

Layer 1: AWS Budget with email alerts (low-cost, last line of defense)

aws budgets create-budget --account-id <YOUR_ACCT> --budget '{
  "BudgetName": "lightsail-monthly",
  "BudgetLimit": {"Amount": "25", "Unit": "USD"},
  "TimeUnit": "MONTHLY",
  "BudgetType": "COST",
  "CostFilters": {"Service": ["Amazon Lightsail"]}
}' --notifications-with-subscribers '[{
  "Notification": {
    "NotificationType":"ACTUAL",
    "ComparisonOperator":"GREATER_THAN",
    "Threshold":80,"ThresholdType":"PERCENTAGE"
  },
  "Subscribers": [{"SubscriptionType":"EMAIL","Address":"<YOU>@example.com"}]
}]'

Two free budgets per account. Set thresholds at 80%, 100%, 120%. The 80% one fires before any real damage and gives you time to react.

Layer 2: Lightsail per-instance spike alarms (catches anomalies fast)

aws lightsail put-alarm \
  --region <region> \
  --alarm-name <instance>-networkout-spike \
  --metric-name NetworkOut \
  --monitored-resource-name <instance> \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --threshold 2147483648 \
  --evaluation-periods 6 --datapoints-to-alarm 6 \
  --contact-protocols Email \
  --notification-triggers ALARM

This fires if the instance does ≥ 2 GB outbound in any 5-minute window, sustained for 6 windows (30 min). Catches: compromised instance, runaway scraping job, badly-configured backup target. Per region. Free.

Note: the email contact has to be set up per region first (lightsail:CreateContactMethod) and verified before alarms can use it.

Layer 3: On-instance bandwidth kill switch (hard ceiling)

The kill switch only sees per-instance counters, so it's not perfect for the cross-instance pool gotcha. But for the single-bundle-per-region case (the common case), it works. A small cron script polls the local NIC counter every 5 minutes and stops the VPN daemon if the month-to-date NIC output crosses a threshold:

#!/bin/bash
# /usr/local/bandwidth-guard.sh — runs every 5 min from /etc/cron.d/bandwidth-guard
QUOTA_GB="${1:-3000}"
THRESHOLD="${2:-0.95}"
STATE=/var/lib/iface-monthly.json
MONTH=$(date +%Y-%m)
LOCK="/var/run/bandwidth-guard.lock-${MONTH}"

[[ -f "$LOCK" ]] && exit 0
[[ -f "$STATE" ]] || exit 0

month_tx=$(python3 -c "import json; print(json.load(open('$STATE')).get('monthly_tx',0))")
tx_gb=$(awk -v b="$month_tx" 'BEGIN{printf "%.2f", b/1e9}')
threshold_gb=$(awk -v q="$QUOTA_GB" -v t="$THRESHOLD" 'BEGIN{printf "%.2f", q*t}')

if awk -v tx="$tx_gb" -v th="$threshold_gb" 'BEGIN{exit !(tx > th)}'; then
  systemctl stop xray
  systemctl disable xray
  touch "$LOCK"
  logger -t bandwidth-guard "Killed xray at ${tx_gb}GB / ${QUOTA_GB}GB quota"
fi

Pair with a companion iface-monthly.json updater that reads /proc/net/dev every 5 min and accumulates the monthly transmit-bytes counter. The lock file prevents the script from oscillating (e.g., if you manually restart xray after the cap fires, the lock keeps the guard from re-killing it within the same month).

This script does NOT survive the delete-and-recreate gotcha — the new instance has a zeroed iface-monthly.json and won't know the regional pool is already at 95%. For that, see Layer 4.

Layer 4: Mac-side panel using Cost Explorer for pool truth

A control-station script that runs anywhere (your laptop, a small VPS, wherever you'd already check status) queries Cost Explorer once an hour and computes:

def regional_pool_used(region_code, bundle_id):
    # Pull MTD usage-type breakdown from CE
    ce_data = ce.get_cost_and_usage(
        TimePeriod={'Start': month_start, 'End': month_end},
        Granularity='MONTHLY',
        Metrics=['UsageQuantity'],
        Filter={'Dimensions': {'Key': 'SERVICE', 'Values': ['Amazon Lightsail']}},
        GroupBy=[{'Type': 'DIMENSION', 'Key': 'USAGE_TYPE'}]
    )

    prefix = REGION_PREFIX[region_code]  # APN1, USE2, USW2, etc.
    in_gb = sum(g for g in ce_data if g.key == f"{prefix}-TotalDataXfer-In-Bytes")
    out_gb = sum(g for g in ce_data if g.key == f"{prefix}-TotalDataXfer-Out-Bytes")
    over_gb = sum(g for g in ce_data if g.key == f"{prefix}-DataXfer-Out-Overage-Bytes")

    pool_gb = bundles.get_bundle(region_code, bundle_id).transferPerMonthInGb
    used_gb = in_gb + out_gb + over_gb
    return used_gb, pool_gb, used_gb / pool_gb * 100

Cache the CE response for 1 hour. The query costs $0.01; an hour cache gives you ~720 queries/month max = $7.20 if you literally open the panel every other hour all day. Daily cache is $0.30/month.

This is the layer that catches the delete-and-recreate gotcha: it sees the regional total regardless of instance lifecycle.

Bonus: AWS Cost Anomaly Detection

Free, ML-based. Enable it in the Cost Management console (one click, doesn't need additional IAM if you do it via console). It learns your baseline and emails when something deviates. Will catch the next incident before the budget alarm does.

Pre-deployment checklist

Before every new Lightsail node — or every modification to an existing one — run through this checklist:

☐ Region + bundle inventory: which existing instances does this share a pool with?
   (If same region + same bundleId as an existing instance, you're pooling.)

☐ Calculate pool ceiling: sum of bundle.transferPerMonthInGb for that pool.
   For new bundle in new region: it's just this instance's bundle quota.

☐ Estimate monthly traffic: in_GB + out_GB combined.
   If your estimate is > 80% of the pool ceiling, size up (different bundle or different region).

☐ Set up alarms BEFORE the instance accumulates traffic:
   - AWS Budget at $X/mo (cost-based, blunt)
   - Lightsail spike alarm (catches sudden bursts)
   - On-instance bandwidth-guard script

☐ Verify networking:
   - sysctl: net.ipv6.conf.<iface>.disable_ipv6 = 1 (per-interface, persist in /etc/sysctl.d/)
   - ip -6 addr show dev <iface> returns no inet6 addresses
   - ip6tables -P INPUT DROP (defense in depth)
   - chrony or systemd-timesyncd active (TLS-fingerprint protocols need accurate time)

☐ If restoring from a snapshot to a region+bundle where an instance was deleted this month:
   - Confirm you accept that the pool is shared with the deleted-instance usage
   - OR wait until the 1st of next month for a clean start
   - OR use a DIFFERENT bundle/region for the restoration

☐ Document the egress: which Webshare / iproxy / direct exit does this instance use?
   - Update the routing table so traffic categorization stays accurate

☐ Subscription / client config update: make sure your VPN client knows about the new node

Reading list

The AWS docs that explain the rules I learned the hard way:

And the prerequisite RouteHarden modules that intersect with this:

Closing

Final May damage was about $30 in Lightsail data transfer overage plus the bundle hours on the now-deleted Tokyo instance — call it $54 total across all 4 nodes. None of it was strictly unavoidable: every gotcha is documented somewhere on docs.aws.amazon.com. The actual lesson is that infrastructure pricing rules don't sit in one document, and the per-account quirks (CE lag, billing-page lag, snapshot-restore not re-applying sysctl) only show up when you hit them.

Build the layered defense first. Lightsail is cheap, but cheap doesn't mean cheap-to-misconfigure.

Newsletter

Liked this? Get one a week.

One technical post per week — same depth, no spam.

Related reading
Need help shipping this?

We do this kind of work for hire.

Network architecture review, self-hosted privacy stacks, zero-trust corporate VPNs.

SEE ENGAGEMENTS