Detecting Low-and-Slow Data Exfiltration with Isolation Forest + LSTM

Low-and-Slow Data Exfiltration Detection — Isolation Forest + LSTM autoencoder on VPC Flow Logs — HACKFORLAB cover image

From the hunt desk. The “outbound bytes > threshold” exfil rule catches noisy malware and zero serious operators. Real APTs leak 50 KB/hour disguised as HTTPS, spread across weeks. By the time your volumetric rule fires, the data is gone. This post is the detection-engineering replacement for MITRE ATT&CK TA0010 (Exfiltration) — an Isolation Forest + LSTM autoencoder ensemble that catches both point-in-time and gradual-drift exfiltration. Full sub-technique coverage table, channel classification (DNS-tunnel, ICMP-tunnel, HTTPS-covert, staged-nightly), and purple-team validation paths are all below.

APT operators do not exfiltrate at 200 Mbps. They exfiltrate at 50 KB per hour, disguised as ordinary HTTPS traffic, spread across days or weeks. They use DNS tunnelling (data encoded into TXT-record queries), ICMP tunnelling (payloads stuffed into echo-request packets), HTTPS with steganographic payloads, or protocol-compliant traffic on port 443 that walks through every DPI rule a network security team has ever written. Any volumetric detection rule sitting in your SIEM today — “alert when host X uploads more than 5 GB in an hour” — was designed for a different decade.

This playbook walks through the detection pipeline that does see those exfiltration channels. It combines Isolation Forest (an unsupervised tree-based anomaly detector) with an LSTM autoencoder over rolling 7-day feature sequences. The Isolation Forest catches point-in-time anomalies that look unusual right now; the LSTM autoencoder catches gradual deviations — the slow ramp-up that defines successful APT exfiltration. Together they cover the full temporal range from sudden bursts to multi-week drift.

This is post #3 of our five-part VPC Flow Log detection-engineering series. The companion posts cover adaptive C2 beacons with FFT and lateral movement with graph analysis, with two more posts to come on botnet clustering and living-off-the-land kill chains.

Why Single-Dimension Exfiltration Detection Fails

The classic exfiltration rule looks for outbound byte spikes. It catches naïve attackers and noisy malware. It does not catch APTs because real APTs deliberately stay under whatever volumetric threshold you set. The 50-KB-per-hour rate is a real number — at that rate, exfiltrating a one-million-row database (~2 GB) takes roughly five weeks. From a volumetric perspective, it is indistinguishable from a software update, a sync client, or a busy user pulling email.

A single dimension is insufficient. But combined dimensions — byte volume, destination reputation, protocol mix, temporal pattern, packet-size distribution — produce a multi-dimensional signature that legitimate traffic almost never matches. The pipeline below builds that ten-dimensional feature vector per host per hour, baselines it, and then uses ML methods that thrive on high-dimensional anomaly detection.

Specifically, the four covert-channel patterns we want to surface:

DNS tunnelling — payloads encoded into long subdomain names or TXT-record queries. Signature: anomalously high dns_volume_ratio and elevated bytes_per_dns_packet.
ICMP tunnelling — data smuggled into echo-request payloads. Signature: protocol = 1 with high bytes per packet (legitimate ICMP is < 100 bytes; tunnels are 1,000+).
HTTPS covert channels — consistent small transfers to never-before-seen destinations. Signature: high new_dst_ratio + stereotyped flow sizes.
Staged nightly exfiltration — bytes pile up between 22:00 and 06:00 local time to the same destination. Signature: high night_traffic_ratio + low destination entropy.

The Ten-Dimensional Per-Host Traffic Profile

For each srcAddr we build a 24-hour rolling feature vector containing:

Total outbound bytes (24h)
Unique destination count (24h)
Destination entropy — Shannon entropy of the destination IP distribution. High entropy = many destinations; low entropy = repeat callers
Protocol ratio vector — proportions of TCP, UDP, ICMP, other
Average packet size
Flow rate — flows per minute
New-destination ratio — destinations not seen in the prior 14-day baseline
DNS volume ratio — bytes destined to port 53 divided by total bytes
Night-traffic ratio — bytes between 22:00 and 06:00 divided by total bytes
Bytes-per-destination standard deviation — uniformity of per-destination transfer sizes

Each feature is z-score-normalised against a per-host 14-day baseline before going into the models. Per-host baselines are critical — a developer workstation, a backup server, and a CI runner all have radically different normal profiles. Global baselines are useless.

Isolation Forest for Point-in-Time Anomaly Detection

Isolation Forest (Liu, Ting & Zhou, 2008) is the workhorse of point-in-time anomaly detection. It builds an ensemble of random decision trees on the feature vectors, splitting on random features at random thresholds. Anomalous points reach a leaf in fewer splits than normal points — they are “isolated” earlier — and the anomaly score is derived from the average path length across the ensemble.

Formally, the anomaly score for a point x is:

score(x) = 2 ^ ( −E(h(x)) / c(n) )

where  h(x) = average path length of x across trees
       c(n) = average path length in a binary search tree of n points

Scores closer to 1 indicate stronger anomalies. The pipeline uses contamination = 0.01 (expecting roughly 1% of points to be anomalous) and trains on the prior 14 days of feature vectors. Any new vector with a score below −0.5 — equivalently, sklearn’s decision_function output below −0.5 — goes into the investigation queue.

The strength of Isolation Forest is that it is unsupervised, fast, and explainable. The weakness is that it is fundamentally point-in-time — it does not see the temporal trajectory. That gap is exactly what the LSTM autoencoder fills.

LSTM Autoencoder for Gradual-Drift Detection

Long Short-Term Memory (LSTM) autoencoders learn to reconstruct a time series — in our case, a 7-day sequence of hourly 10-dimensional feature vectors per host. Trained on the baseline 14 days of behaviour, the autoencoder gets very good at reproducing normal trajectories and very bad at reproducing trajectories it has never seen.

The reconstruction error is the anomaly signal:

L = (1 / T) · Σ |x_t − x̂_t|²    over 24-hour rolling window
alert: L > μ_baseline + 3·σ_baseline

Critically, this catches the shape of slow exfiltration. A host whose outbound bytes drift upward by 5% per day for two weeks produces a trajectory the autoencoder has not seen during training — and the reconstruction error climbs steadily across those two weeks. Isolation Forest would not fire on any single day of that drift because each day individually is barely above baseline. The autoencoder does fire, because the integral of the deviation has become a different shape entirely.

The architecture we use is intentionally modest: a 2-layer LSTM encoder with hidden size 64, a bottleneck of 16 latent dimensions, and a symmetric 2-layer LSTM decoder. Trained with mean squared error loss over the 14-day baseline, then locked. Re-training happens weekly to follow legitimate drift.

Ensemble Scoring and Channel Classification

The final exfiltration-confidence score combines both models with rule-based features:

Exfil Confidence = 0.4 · IF_score
                + 0.3 · LSTM_error
                + 0.2 · z_bytes
                + 0.1 · new_dst_ratio

Once a host crosses the alerting threshold, the pipeline classifies the likely channel based on which features dominate the score:

DNS tunnel: high dns_volume_ratio + elevated bytes_per_dns_packet.
ICMP tunnel: protocol = 1 traffic with payloads > 100 bytes.
HTTPS covert: high new_dst_ratio with stereotyped transfer sizes.
Staged exfil: bytes concentrated in 22:00–06:00 to a single destination.

Classification is helpful for analyst triage because each channel has a different response playbook. DNS tunnelling can be killed at the resolver. ICMP tunnelling can be killed at the firewall. HTTPS covert channels require destination-side analysis. Staged exfiltration requires inspection of the destination’s reputation and the data itself.

Feature Engineering from VPC Flow Logs

Feature	Source attributes	Formula	What it captures
Destination entropy	dstAddr	−Σ p(dst) · log₂(p(dst))	Diversity of external targets
New destination ratio	dstAddr, start	new_dsts_today / total_dsts_today	Never-before-seen targets
Protocol distribution	protocol, dstPort	[%TCP, %UDP, %ICMP, %other]	Protocol shift detection
Bytes-per-packet histogram	bytes, packets	histogram(bytes / packets, 10 bins)	Tunnel / covert-channel signature
Night traffic ratio	start, bytes	bytes_22h–06h / bytes_total	Off-hours exfil detection
DNS volume anomaly	dstPort = 53, bytes	dns_bytes_today / avg_dns_bytes_14d	DNS tunnelling indicator
Cumulative drift	bytes, start	Σ (daily_bytes) rolling 7 days	Gradual exfil ramp detection
Flow symmetry	bytes (both directions)	bytes_out / bytes_in per destination	Exfil = high asymmetry

Athena SQL — Multi-Dimensional Host Profiling

The query builds the hourly feature vectors and computes per-host z-scores against the prior 14-day baseline in a single Athena run:

WITH hourly_profile AS (
    SELECT srcaddr,
           DATE_FORMAT(from_unixtime(start), '%Y-%m-%d %H') AS hour,
           SUM(bytes)                                    AS bytes_out,
           COUNT(DISTINCT dstaddr)                       AS unique_dsts,
           COUNT(*)                                      AS flow_count,
           AVG(bytes)                                    AS avg_bytes_per_flow,
           AVG(CAST(bytes AS DOUBLE) / NULLIF(CAST(packets AS DOUBLE), 0)) AS avg_pkt_size,
           SUM(CASE WHEN dstport = 53 THEN bytes ELSE 0 END)              AS dns_bytes,
           SUM(CASE WHEN protocol = 1 THEN bytes ELSE 0 END)              AS icmp_bytes,
           SUM(CASE WHEN dstport IN (443, 80) THEN bytes ELSE 0 END)      AS web_bytes,
           COUNT(DISTINCT dstport)                       AS port_diversity
    FROM central_vpc_flow_logs
    WHERE action = 'ACCEPT' AND srcaddr LIKE '10.%'
      AND dstaddr NOT LIKE '10.%' AND dstaddr NOT LIKE '172.%'
      AND dstaddr NOT LIKE '192.168.%'
      AND day BETWEEN '2026/03/01' AND '2026/03/23'
    GROUP BY srcaddr, DATE_FORMAT(from_unixtime(start), '%Y-%m-%d %H')
),
baseline AS (
    SELECT srcaddr,
           AVG(bytes_out)    AS avg_bytes,   STDDEV(bytes_out)   AS std_bytes,
           AVG(unique_dsts)  AS avg_dsts,    STDDEV(unique_dsts) AS std_dsts,
           AVG(dns_bytes)    AS avg_dns,     STDDEV(dns_bytes)   AS std_dns,
           AVG(flow_count)   AS avg_flows
    FROM hourly_profile
    WHERE hour < '2026-03-19'
    GROUP BY srcaddr
)
SELECT h.srcaddr, h.hour,
       (h.bytes_out  - b.avg_bytes) / NULLIF(b.std_bytes, 0) AS z_bytes,
       (h.unique_dsts - b.avg_dsts) / NULLIF(b.std_dsts, 0)  AS z_dsts,
       (h.dns_bytes  - b.avg_dns)  / NULLIF(b.std_dns, 0)    AS z_dns,
       h.icmp_bytes, h.avg_pkt_size, h.port_diversity,
       h.dns_bytes * 1.0 / NULLIF(h.bytes_out, 0)            AS dns_ratio
FROM hourly_profile h
JOIN baseline b ON h.srcaddr = b.srcaddr
WHERE h.hour >= '2026-03-19'
HAVING (h.bytes_out - b.avg_bytes) / NULLIF(b.std_bytes, 0) > 2
    OR (h.dns_bytes - b.avg_dns)  / NULLIF(b.std_dns, 0)  > 3
    OR h.icmp_bytes > 100000
ORDER BY z_bytes DESC;

Tuning notes:

The hour < '2026-03-19' cutoff defines the baseline window. Roll it forward daily.
The triple HAVING ORs is the candidate filter — at least one of byte z-score > 2, DNS z-score > 3, or ICMP > 100 KB. Anything that passes goes downstream to the ML layer.
Output volume is small — typically 50–500 candidate (host, hour) pairs per day. The ML layer evaluates each one in milliseconds.

Putting It Into Production

The full pipeline runs on commodity AWS infrastructure:

VPC Flow Logs → S3 (Parquet partitioned by day) — already in place if you have implemented our VPC Flow Logs hunting primer.
EventBridge → daily Athena query at 04:00 local time.
Lambda or SageMaker job loads the candidate vectors and scores them through both models. Isolation Forest runs in milliseconds; LSTM autoencoder inference takes 5–50 ms per host.
Combined scores → SNS / Kinesis → SIEM, with the channel classification and supporting features attached.
Weekly retraining of both models on the most recent 14 days of clean baseline data.

The Python is short. The Isolation Forest part is six lines with scikit-learn. The LSTM autoencoder is roughly 80 lines of PyTorch or TensorFlow. Existing open-source implementations work fine — there is no need to reinvent the wheel.

Detection Coverage Matrix

Exfiltration technique	Primary signal	Detection
Volumetric burst (> 1 GB / hour)	z_bytes spike	Trivial — both models fire instantly
Steady leak (50 KB / hour, 24×7)	LSTM reconstruction drift	Caught after 3–5 days of accumulated error
DNS tunnelling (iodine, dnscat2)	dns_volume_ratio + avg_pkt_size	Strong detection — DNS legitimately doesn’t carry kilobytes
ICMP tunnelling (ptunnel)	protocol=1 + bytes > 100	Trivial — legitimate ICMP is tiny
HTTPS covert (new destination)	new_dst_ratio + IF score	Strong, depending on the noise of new SaaS traffic
HTTPS covert (existing destination)	flow symmetry + LSTM	Moderate — requires longer observation
Cloud-storage exfil to attacker-owned S3	new_dst_ratio + flow_count + bytes	Strong — assuming destination ASN allow-list is maintained
Steganographic exfil over image-hosting CDN	flow symmetry + new_dst_ratio	Partial — needs reputation enrichment
Email-based exfil (SMTP / IMAP relay)	port-specific filter	Out of scope — use email security stack

Limits and False-Positive Sources

Cloud backup agents (enterprise backup agents) produce exactly the high-byte / single-destination signature we alert on. Maintain a destination allow-list.
Large software updates (Windows Update, macOS Software Update, package mirrors) spike bytes to a small set of well-known endpoints — allow-list by FQDN or ASN.
Video conferencing (Zoom, Teams, Meet, Webex) produces sustained high-volume traffic to provider CDNs. Allow-list and time-bound by working hours.
CDN log shipping (CloudFront → S3, application logs → SaaS observability) creates legitimate high-volume periodic transfers. Tag at source.
Genuinely new SaaS adoption creates new-destination signals until the baseline catches up. Use a service-onboarding notification channel to pre-suppress.

The cleanest operational pattern is a layered allow-list of approved enterprise destinations (corporate SaaS endpoints, approved CDN ASNs, backup providers), supplemented with role-based suppression for newly approved services during their first 14 days.

MITRE ATT&CK Techniques Covered by This Detection

This pipeline maps to the Exfiltration (TA0010) tactic — the back end of every full intrusion. The Isolation Forest catches point-in-time anomalies; the LSTM autoencoder catches the slow drift. Together they cover the full temporal range of how operators get data out. The table is your hunt-coverage worksheet.

ATT&CK ID	Technique / sub-technique	Coverage	Hunter notes
T1041	Exfiltration Over C2 Channel	Full	Volume spike on the C2 destination — Isolation Forest fires immediately
T1048	Exfiltration Over Alternative Protocol (parent)	Full	Protocol-distribution shift is a primary feature
T1048.001	Over Symmetric Encrypted Non-C2	Full	HTTPS to new destination + flow asymmetry
T1048.002	Over Asymmetric Encrypted Non-C2	Full	—
T1048.003	Over Unencrypted Non-C2 (incl. DNS)	Full	DNS-tunnel signature is the cleanest detection in the pipeline
T1567	Exfiltration Over Web Service (parent)	Full	—
T1567.001	Exfiltration to Code Repository	Partial	GitHub.com / your source-control platform.com look legitimate; needs destination reputation enrichment
T1567.002	Exfiltration to Cloud Storage	Full	S3 / Azure Blob / GCS new-destination signal is strong
T1567.003	Exfiltration to Text Storage Sites (paste.ee, etc.)	Partial	Specific FQDN allow-lists make this trivial; without them, harder
T1029	Scheduled Transfer (nightly cron exfil)	Full	night_traffic_ratio feature is purpose-built for this
T1030	Data Transfer Size Limits (chunked)	Full	Cumulative drift feature catches the slow ramp
T1011	Exfiltration Over Other Network Medium	Partial	Bluetooth / cellular — out of VPC scope
T1011.001	Exfiltration Over Bluetooth	Out of scope	Endpoint detection only
T1052	Exfiltration Over Physical Medium	Out of scope	Endpoint / DLP only
T1071.004	Application Layer Protocol: DNS	Full	Same DNS volume signature as T1048.003
T1132	Data Encoding (base64, hex, custom)	Partial	Encoding doesn’t change byte volume — caught indirectly
T1140	Deobfuscate/Decode Files or Information	Out of scope	Host-side; pair with EDR
T1020	Automated Exfiltration	Full	Bot-driven exfil shows on per-host LSTM error

Adversary emulation / purple-team validation. Run public adversary-emulation atomics T1041 (Exfiltration Over C2) and T1048.003 (DNS exfil via iodine or dnscat2) against a lab host. For the slow-drift variant — the LSTM is the only model that catches this — emulate gradual exfil by running a Python script that uploads 500 KB to an external endpoint every hour for seven days. The Isolation Forest stays quiet; the LSTM reconstruction error climbs linearly. That’s exactly the test that distinguishes mature pipelines from naïve volumetric rules.

Sigma / detection-as-code. Once your pipeline emits a structured exfil_confidence field, the SIEM Sigma rule is trivial. Tag every alert with x-mitre-tactic-id = TA0010 for clean dashboarding, and route channel-classification labels (DNS_TUNNEL, ICMP_TUNNEL, HTTPS_COVERT, STAGED_NIGHTLY) into the alert metadata so analysts skip the “what is this?” step.

D3FEND mappings. The pipeline implements D3-OTF (Outbound Traffic Filtering) at the detection layer and D3-NTA as the parent.

Where This Sits in a Mature Threat Hunting Programme

Exfiltration detection is the back end of the kill chain — by the time it fires, the adversary already has data. Pair it with the upstream detections in the same series:

Adaptive C2 beacon detection (FFT + DBSCAN) — initial access / command channel.
Lateral movement graph detection — internal pivot.
VPC Flow Log attack hunting.
Outbound network threat hunting.
Cloud attack threat hunting.
Hunting AWS identity attacks.
AWS Bedrock CloudTrail playbook.
Authentication-event threat hunting.

Closing Thoughts

Low-and-slow exfiltration is the hardest network detection problem in the modern enterprise. A two-model ensemble — Isolation Forest for the instant, LSTM autoencoder for the slow drift — covers both extremes with telemetry you already collect. The investment is a few engineer-weeks for the first deployment, and the marginal cost of running it daily is negligible.

Tune to your environment. Maintain the allow-list. Re-baseline weekly. If you have an exfiltration story we should cover in a follow-up, get in touch via the contact page. Happy threat hunting.

#threathunting #dataexfiltration #dnstunneling #icmptunneling #isolationforest #lstmautoencoder #anomalydetection #vpcflowlogs #awssecurity #cloudsecurity #mitreattack #soc #blueteam #infosec #ml #detectionengineering