Detection · Part 4 of 6·Anonymity Engineering·2026-05-02·11 min read·advanced

Encrypted traffic classification with ML

How feature engineering, deep learning, dataset design, and concept drift shape machine-learning-based classification of encrypted traffic.

The previous modules established that DPI moves from payload pattern matching to metadata classification when encryption hides content. This module digs into the ML side of that move: how machine-learning classifiers actually learn to identify encrypted traffic, what features they consume, where they get their training data, what deep-learning architectures changed the game, and why the operational gap between benchmark accuracy and production deployment is large.

The thesis: ML-based traffic classification works, sometimes well, but treating it as a solved problem ignores the dataset, drift, and adversarial-evasion realities that production systems face every day. The right framing is detection engineering: a classifier is one operational tool that requires continuous data pipelines, evaluation harnesses, retraining cadence, and adversary-aware design — not a one-shot model that you train and deploy.

Prerequisites

traffic-analysis-fundamentals — for the attack-side knowledge motivating classifier design.
tls-fingerprinting-in-production — TLS fingerprints are common features in modern classifiers.
os-and-tcpip-stack-fingerprinting — layered fingerprinting feeds into ML.

Learning objectives

Describe the feature spaces ML classifiers use for encrypted traffic — flow statistics, sequence features, fingerprints — and what each captures.
Explain why deep learning attracted attention for this problem and what architectures (CNN, LSTM, transformer) are commonly used.
Identify the dataset and labeling problems that undermine published benchmark accuracy when classifiers move to production.
Describe concept drift, evaluate why it's the central operational challenge, and identify mitigation patterns.

Feature spaces for encrypted classification

Encrypted traffic classifiers work from observable features at multiple layers. The major feature families:

Flow-statistical features. Per-flow aggregate measurements:

Total bytes per direction
Total packets per direction
Mean / std / min / max / percentiles of packet size distribution
Mean / std of inter-arrival times
Flow duration
Burst structure (bursts per flow, sizes, gaps)
Direction ratios

These survive encryption — they don't care what the payload says, just how much and when. Classical (pre-2018) classifiers relied heavily on hand-engineered flow-statistical features feeding random forests, SVMs, gradient boosting models.

Sequence features. Per-packet sequences for the first N packets of a flow:

Sequence of packet sizes (e.g., [+517, -1418, -1418, -803, +98, ...])
Sequence of inter-arrival deltas
Direction sequence

Sequence features feed into deep models that can learn temporal patterns. CNNs treat the sequence as a 1D signal; RNNs/LSTMs process sequentially; transformers attend across the sequence.

Fingerprint features. Per-handshake metadata:

TLS JA3/JA4 hash
HTTP/2 SETTINGS values
TCP option order (p0f-style)
SNI when visible
Server certificate metadata

Fingerprint features identify the client implementation; combined with flow features, they help distinguish "what kind of client doing what kind of activity."

Behavioral features. Cross-flow temporal patterns:

Number of flows per source per time window
Periodicity of flows
Correlation between flows on different ports
Connection retry patterns

Behavioral features catch things single-flow features miss: beaconing malware, scanning, DNS tunneling.

The right feature set depends on the classification task. Application-class identification (web vs. video vs. VPN) lives mostly in flow-statistical features. Specific-app identification (Netflix vs. YouTube) needs more discrimination, often combining flow-statistical with fingerprint features. Tunnel-protocol identification (WireGuard vs. OpenVPN vs. naïveproxy) often depends heavily on handshake metadata. Bot detection looks at fingerprints plus behavioral.

Why deep learning got attention

Around 2018, papers like Lotfollahi's "Deep Packet" demonstrated that deep convolutional networks could classify encrypted traffic with substantially higher accuracy than classical methods, often without manual feature engineering. The model was fed the raw sequence of packet sizes and inter-arrival times; it learned features automatically.

The advantages over classical methods:

Less feature engineering. Hand-crafting features required expert knowledge of what carries signal. Deep models learn discriminative features automatically from labeled data. Reduces the engineering investment.

Better with high-dimensional inputs. Sequence features with N=100 packets give 200+ values per flow (size + timing). Classical models struggle with this dimensionality without aggressive feature reduction; deep models handle it natively.

Better at temporal patterns. RNNs and transformers explicitly model sequence; classical models that flattened sequences into bags-of-features lost temporal information.

Adversary-robust ish. Some deep models, especially those trained with adversarial data augmentation, are more resilient to small perturbations in the input than rule-based classifiers.

The architectures that became common:

1D CNNs. Treat the packet-size sequence as a 1D signal; convolutional filters learn local burst patterns; max-pooling aggregates. The Sirinam et al. "Deep Fingerprinting" paper used a 1D CNN with substantial impact.

LSTMs / GRUs. Process the sequence packet-by-packet, maintaining hidden state. Good at sequential dependencies but slower to train.

Transformers. Self-attention over the sequence. State-of-the-art in many domains, increasingly used for traffic classification. Can attend to long-range dependencies that CNNs miss.

Hybrid models. Stack a CNN feature extractor with an LSTM or transformer head; combine flow-statistical features (fed to a separate dense network) with sequence features (CNN/RNN). Many production systems are hybrid.

The DeepCorr paper (Nasr et al., 2018) demonstrated deep models could perform end-to-end correlation against Tor with high accuracy — better than classical methods at the same task.

Datasets and the labeling problem

ML accuracy is bounded by data quality. For traffic classification, this is harder than it sounds.

Published datasets. Several public datasets exist: ISCX (UNB) datasets for VPN/Tor classification, CIRA/CESNET datasets for network traffic, USTC datasets for malware. Most are 2-10 years old and reflect older protocols.

Label quality. The classes you train on define what the model learns. "Tor traffic" is well-defined; "social media traffic" is fuzzy (Twitter? Facebook? TikTok? News with social-media APIs?). Ambiguous class definitions produce ambiguous classifiers.

Class balance. Real-world traffic distributions are wildly imbalanced. 90%+ of traffic is normal HTTPS; specific-application traffic might be 1% or 0.01%. Training on balanced datasets produces classifiers that overweight rare classes; deploying them on real traffic produces high false-positive rates.

Collection environment. Datasets collected in lab environments don't match real traffic. Stable connections, single-tab browsing, fresh caches, no background traffic — none of these match real users.

Drift over time. A dataset from 2020 has TLS 1.2 ClientHellos; a 2024 deployment has TLS 1.3 with ECH. Models trained on the old data may misclassify the new.

Synthetic vs. real labels. Some datasets generate "labels" by browsing the labeled site at known times; the labels are accurate. Others use heuristics (this IP belongs to Netflix, so traffic to it is video); the labels are noisy.

The right operational practice: collect your own dataset in the environment where the classifier will deploy, with labels validated by ground truth. Public datasets are useful for early exploration but rarely sufficient for production.

Deployment caveats

Benchmark accuracy reported in papers regularly fails to transfer to production. Common reasons:

Single-flow assumption. Papers often assume one flow per evaluation; production traffic has many simultaneous flows from each user. Classifiers trained on isolated flows misbehave in multi-flow contexts.

Stable network assumption. Lab environments have stable RTTs, no packet loss, predictable timing. Production traffic has variable conditions; flow-statistical features like inter-arrival times shift accordingly.

Closed-world evaluation. Papers often classify among a fixed set of known classes (closed world). Production sees the open world — most traffic is "none of the above." Open-world evaluation requires the classifier to reject unknown classes confidently, which is harder.

Concept drift. As discussed, protocols and behaviors change over time. A classifier trained at time T degrades in accuracy by time T + 6 months unless retrained.

Adversarial evasion. Classifiers can be defeated by adversaries who deliberately shape their traffic to avoid the classifier's decision boundary. Deep models are vulnerable to specific perturbations.

The production reality: classifiers need continuous evaluation against fresh ground-truth data, retraining as needed, monitoring for drift, and adversary-aware design. Deploying a one-shot trained model and forgetting about it produces unmaintainable accuracy degradation.

Concept drift in practice

Concept drift causes:

Browser/library updates. Each Chrome release potentially shifts JA4 fingerprints; flow patterns may shift if new features change request behavior.
Protocol changes. HTTP/3 deployment moved a substantial fraction of traffic from TCP to UDP, breaking TCP-flow-based classifiers.
Network changes. New CDNs, IXP changes, ISP routing all shift flow timing.
Application changes. Sites add/remove features; their traffic patterns evolve.
User behavior changes. New apps become popular; old ones decline.
Adversarial drift. Tools designed to evade classifiers shift their patterns.

Mitigation patterns:

Continuous evaluation. Run the classifier against a periodically-refreshed labeled dataset. Track accuracy over time. When accuracy drops below threshold, trigger retraining.

Online learning. Some classifiers update their parameters incrementally as new labeled data arrives. Suitable for some architectures (linear models, tree ensembles); harder for deep models.

Periodic full retraining. Every quarter (or month, or week, depending on drift rate) retrain from scratch on the latest data.

Model versioning. Maintain multiple model versions; deploy new versions gradually with shadow traffic; roll back if production accuracy is worse than expected.

Feature monitoring. Track distributions of input features. Sudden distribution shifts may indicate adversarial behavior or genuine drift; both require investigation.

Layered models. Multiple classifiers with different feature sets; if one drifts, others can carry the load while it's retrained.

Hands-on exercise

Feature table for a toy classifier.

Tools: notes. Runtime: 10 minutes.

Define features for a 3-class classifier (Web browsing, Video streaming, VPN tunnel) using only encrypted-traffic-visible signals:

Flow-statistical features:
- bytes_in_total
- bytes_out_total
- bytes_in_out_ratio
- packet_count
- duration_seconds
- mean_packet_size_in
- mean_packet_size_out
- std_packet_size
- mean_inter_arrival_in
- max_idle_gap

Fingerprint features:
- ja4_first10_chars (categorical)
- alpn_protocol (categorical: h2/h3/http1.1)
- tls_version

Behavioral features:
- flows_to_same_ip_in_window
- consecutive_handshakes_per_minute

Then for each class, predict feature values:

Feature	Web browsing	Video stream	VPN tunnel
bytes_in_out_ratio	5-50	100-1000	~1
duration_seconds	60-300	1800-7200	86400+
mean_packet_size_in	mixed	near-MTU (1418)	uniform (encrypted pad)
max_idle_gap	several seconds	under 1 second	several seconds
ja4 starts with	t13d... (browser)	t13d... (browser)	varies (or absent)
flows_to_same_ip	many (one CDN)	1-2	1

Identify which features alone discriminate which pairs of classes. Identify pairs that need combined features.

Train/validate/deploy pseudocode with drift checks.

def production_classifier_lifecycle():
    # Initial training
    initial_data = load_labeled_traffic("2026-01-01..2026-04-01")
    model = train_model(initial_data)
    deploy(model, version="v1")

    # Continuous monitoring
    while True:
        # Sample real production traffic
        recent_traffic = sample_production(window="last 24 hours")

        # Evaluate against ground truth (where available)
        accuracy = evaluate(model, get_labeled_subset(recent_traffic))
        log_metric("classifier.accuracy", accuracy)

        if accuracy < ACCURACY_THRESHOLD:
            alert("Classifier accuracy below threshold")

            # Trigger retraining
            new_labeled_data = collect_recent_labeled("last 90 days")
            new_model = train_model(new_labeled_data)

            # Shadow-deploy
            deploy(new_model, version="v2", traffic_percent=10)

            # Compare on shadow traffic
            new_accuracy = evaluate_shadow(new_model)
            if new_accuracy > accuracy + IMPROVEMENT_THRESHOLD:
                # Promote to full
                deploy(new_model, version="v2", traffic_percent=100)
                model = new_model
            else:
                # Roll back
                rollback(version="v1")

        sleep(1.day)

The exercise: notice that the deployment loop is more complex than "train once, ship." Production ML requires evaluation, drift detection, retraining, shadow deployment, and rollback infrastructure.

Common misconceptions and traps

"Deep learning solves traffic classification." It improves on classical methods; it doesn't solve the deployment, drift, and adversarial issues. Deployment is harder than the model.

"Benchmark accuracy = production accuracy." Benchmarks under-estimate the deployment gap. A 95% benchmark classifier may produce 70% production accuracy due to dataset/environment mismatch.

"More features always help." More features can increase fragility, training-data needs, and drift sensitivity. The right feature set is "as simple as captures the signal."

"Train once, ship forever." Concept drift makes this impossible for traffic classification. Continuous evaluation and retraining are operational requirements.

"Adversarial evasion is theoretical." Real evasion tools exist (uTLS for TLS impersonation, traffic-shaping for flow obfuscation). Production classifiers need adversary-aware design.

Wrapping up

ML-based traffic classification combines flow-statistical, sequence, fingerprint, and behavioral features into models that can identify application classes, specific applications, tunnel protocols, or anomalous behaviors in encrypted traffic. Deep learning improved over classical methods by reducing manual feature engineering and capturing temporal patterns.

Production deployment is fundamentally a detection-engineering problem: dataset quality, label clarity, concept drift, adversarial evasion, and deployment infrastructure all matter as much as model architecture. Benchmark accuracy is a starting point; production accuracy requires continuous evaluation and retraining.

The next module (side-channels-in-encrypted-protocols — coming soon) covers a related but distinct topic: side channels — leakage through observable structure of encrypted protocols (TLS record lengths, compression oracles, QUIC behavior) — that complement what ML classifiers exploit.

Encrypted traffic classification with ML

Prerequisites

Learning objectives

Feature spaces for encrypted classification

Why deep learning got attention

Datasets and the labeling problem

Deployment caveats

Concept drift in practice

Hands-on exercise

Common misconceptions and traps

Wrapping up

Further reading

Decoy routing and refraction networking

Hysteria and QUIC-based transports

Operational anonymity for engineers

We do this kind of work for hire.