
From the hunt desk. If your environment has compromised internal hosts coordinated by an operator, single-host detections will not see it. The bots look normal individually because they are normal individually. The signal lives in cross-host similarity. This post is the unsupervised-clustering playbook for MITRE ATT&CK T1584.005 (Compromise Infrastructure: Botnet) and TA0040 (Impact) coverage — including pre-DDoS staging detection, the canonical signal that distinguishes a planning operator from background noise.
If your environment has been infected by a coordinated bot operator — IoT botnet, Emotet-style spambot, residential proxy network, or a pre-DDoS staging cluster — the individual bots will not look suspicious. Each one will produce traffic that falls cleanly inside the host’s own baseline. The attack is only visible when you stop looking at hosts in isolation and start asking: are any groups of hosts on my network behaving suspiciously similar to each other?
That question — “show me hosts whose network behaviour has converged” — is exactly what unsupervised clustering algorithms answer. This playbook walks through a pipeline that builds a normalised behavioural fingerprint per host per hour, computes pairwise cosine similarity across all hosts, runs K-means and hierarchical (Ward-linkage) clustering, then validates the clusters by computing temporal cross-correlation of traffic time-series across cluster members. Clusters of three or more hosts that share external destinations and have cross-correlation above 0.8 with near-zero lag are botnet candidates.
This is post #4 of our five-post VPC Flow Log detection-engineering series. The companions are adaptive C2 beacon detection (FFT + DBSCAN), lateral movement graph detection, and low-and-slow data exfiltration with Isolation Forest + LSTM. Post #5 closes the series on living-off-the-land kill chains.
Why Per-Host Botnet Detection Fails
The standard botnet detection signature — “a host has been observed talking to a known C2 IP” — depends on someone, somewhere, having already burned the C2 IP onto a threat-intel feed. For mature operators, by the time the IP is on a feed, the infrastructure has rotated. For commodity botnets, the volume of C2 IPs is too large to maintain. And for novel campaigns, the IPs are never on a feed at all.
The structural insight is that botnet members behave like each other more than they behave like themselves. A normal host’s traffic looks like its own historical traffic — the developer workstation today resembles the developer workstation yesterday. A botnet member’s traffic looks like the other botnet members today — which is a different beast entirely from itself yesterday. Standard per-host baselines miss this; cross-host similarity does not.
Three patterns in particular justify clustering-based detection:
- Synchronised C2 callbacks. Botnets often poll a shared C2 at the same moment. Even with per-bot jitter, the cross-correlation of traffic time-series across members reveals the synchronisation.
- Shared external destinations. All members of a botnet eventually hit the same operator infrastructure. Jaccard similarity over destination sets across host pairs reveals the overlap, even when no single destination is on a threat-intel feed.
- Pre-DDoS staging. Before a DDoS attack, bots run DNS lookups and TCP probes against the target. Individual lookups are unremarkable; clusters of dozens of internal hosts running the same DNS queries within minutes are not.
Building the Per-Host Behavioural Fingerprint
For each srcAddr and each 1-hour window, we build a normalised feature vector. The 14 dimensions we use:
- Total outbound bytes
- Flow count
- Unique destination IP count
- Top-destination hash (a stable hash of the most-talked-to destination)
- Destination port distribution (vector across known service ports)
- Protocol ratio (TCP / UDP / ICMP / other)
- Average inter-arrival time
- Average packet size
- DNS query rate
- New-destination percentage
- HTTPS proportion of traffic
- HTTP proportion of traffic
- DNS proportion of traffic
- UDP proportion of traffic
Every feature is min-max normalised to [0, 1] so that no single feature dominates the similarity metric. Normalisation is per-feature, per-day — the bytes feature is rescaled against the daily population, not the host’s own history.
Cosine Similarity and Pairwise Behavioural Comparison
Once each host has a 14-dimensional vector for a given hour, we compute the pairwise cosine similarity between every host pair in the population. The metric is:
cos(A, B) = (A · B) / (‖A‖ · ‖B‖)
Cosine similarity ranges from −1 (opposite directions) to +1 (identical direction). For high-dimensional behavioural fingerprints, the operationally interesting threshold sits around 0.85 — host pairs with similarity above 0.85 in the same time window are operating in nearly identical behavioural states. In healthy environments, very few pairs cross this threshold; in a botnet-infected environment, you see dozens or hundreds.
For very large environments (10,000+ internal hosts), computing the full pairwise matrix is O(n²) and gets expensive. Two mitigations:
- Locality-sensitive hashing (LSH) approximates the nearest-neighbour search in sub-quadratic time. Worth the complexity at > 5,000 hosts.
- Sliding-window restriction — only compute similarities within hosts that have non-zero outbound activity in the same hour. The off-hours subset is small.
K-Means and Hierarchical Clustering
From the behavioural vectors and the pairwise similarities, two clustering algorithms run in parallel:
K-means with k selected automatically via the silhouette score. The silhouette score for a point i is:
s(i) = (b(i) − a(i)) / max(a(i), b(i))
where a(i) = mean distance from i to other points in its cluster
b(i) = mean distance from i to points in the nearest other cluster
We sweep k from 2 to 30 and pick the k that maximises the average silhouette. Silhouettes above 0.7 indicate strong, well-separated clusters; values below 0.5 suggest the partition is weak. A botnet-infected environment typically produces one or more very high-silhouette clusters embedded in an otherwise low-silhouette population.
Hierarchical clustering with Ward linkage builds a dendrogram showing which hosts merge into clusters first (most similar) and last (most distinct). The dendrogram is visual gold for analyst review — a tight branch of 8 hosts that fuses at very low distance is a strong botnet candidate even before any temporal validation.
The two algorithms catch different cases. K-means is fast, scales well, and produces clear cluster assignments. Hierarchical clustering captures nested structure (sub-clusters within larger clusters) and is more robust to non-spherical cluster shapes. Run both, take the intersection.
Temporal Validation and Shared-Destination Analysis
A cluster of behaviourally similar hosts is necessary but not sufficient. The final validation steps are:
Temporal cross-correlation. For each pair of hosts in a cluster, compute the cross-correlation of their flow-rate time series:
corr(A, B) = Σ_t (flow_rate_A(t) · flow_rate_B(t))
/ √(Σ_t flow_rate_A(t)² · Σ_t flow_rate_B(t)²)
Cross-correlation above 0.8 with a near-zero lag (≤ 60 seconds) is strong evidence of synchronisation. Real bots receiving the same C2 command produce exactly this signature.
Shared external destinations (Jaccard similarity). The destination-set Jaccard similarity between host pairs is:
J(A, B) = |dst_set_A ∩ dst_set_B| / |dst_set_A ∪ dst_set_B|
A Jaccard score above 0.5 for three or more host pairs within the same cluster, combined with at least one shared external destination, fires the high-confidence botnet alert. The shared destination is extracted automatically and pushed to the threat-intel layer as an emerging C2 candidate.
Feature Engineering from VPC Flow Logs
| Feature | Source attributes | Formula | What it captures |
|---|---|---|---|
| Behaviour vector | all outbound attributes | 14-dim normalised vector per host / hour | Host traffic fingerprint |
| Destination overlap | srcAddr, dstAddr | |dst_A ∩ dst_B| / |dst_A ∪ dst_B| | Shared C2 infrastructure (Jaccard) |
| Temporal sync score | start per srcAddr | cross_correlation(flow_rate_A, flow_rate_B) | Synchronised bot activity |
| Port distribution vector | dstPort | [%p80, %p443, %p53, %p_other] per host | Service-access fingerprint |
| Volume anomaly | bytes per host | (bytes_current − μ) / σ per host | Individual host deviation |
| DNS burst correlation | dstPort = 53, start | correlation(dns_rate_A, dns_rate_B) | Pre-DDoS DNS staging detection |
| REJECT correlation | action = REJECT | correlation(reject_rate_A, reject_rate_B) | Synchronised scanning / probing |
Athena SQL — Host Behaviour Vectorisation
WITH host_hourly AS (
SELECT srcaddr,
DATE_FORMAT(from_unixtime(start), '%Y-%m-%d %H') AS hour,
SUM(bytes) AS bytes_out,
COUNT(*) AS flow_count,
COUNT(DISTINCT dstaddr) AS unique_dsts,
COUNT(DISTINCT dstport) AS unique_ports,
AVG(bytes) AS avg_flow_bytes,
SUM(CASE WHEN dstport = 443 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS pct_https,
SUM(CASE WHEN dstport = 53 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS pct_dns,
SUM(CASE WHEN dstport = 80 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS pct_http,
SUM(CASE WHEN protocol = 17 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS pct_udp
FROM central_vpc_flow_logs
WHERE action = 'ACCEPT' AND srcaddr LIKE '10.%'
AND dstaddr NOT LIKE '10.%' AND dstaddr NOT LIKE '172.%'
AND day BETWEEN '2026/03/19' AND '2026/03/23'
GROUP BY srcaddr, DATE_FORMAT(from_unixtime(start), '%Y-%m-%d %H')
),
shared_dests AS (
SELECT a.srcaddr AS host_a, b.srcaddr AS host_b, a.hour,
COUNT(DISTINCT a_dst) AS shared_destinations
FROM (
SELECT srcaddr, DATE_FORMAT(from_unixtime(start), '%Y-%m-%d %H') AS hour,
dstaddr AS a_dst
FROM central_vpc_flow_logs
WHERE action='ACCEPT' AND srcaddr LIKE '10.%' AND dstaddr NOT LIKE '10.%'
AND day BETWEEN '2026/03/19' AND '2026/03/23'
) a
JOIN (
SELECT srcaddr, DATE_FORMAT(from_unixtime(start), '%Y-%m-%d %H') AS hour,
dstaddr AS b_dst
FROM central_vpc_flow_logs
WHERE action='ACCEPT' AND srcaddr LIKE '10.%' AND dstaddr NOT LIKE '10.%'
AND day BETWEEN '2026/03/19' AND '2026/03/23'
) b
ON a.hour = b.hour AND a.a_dst = b.b_dst AND a.srcaddr < b.srcaddr
GROUP BY a.srcaddr, b.srcaddr, a.hour
HAVING COUNT(DISTINCT a_dst) >= 5
)
SELECT * FROM shared_dests ORDER BY shared_destinations DESC;
Tuning notes:
- The
HAVING COUNT(DISTINCT a_dst) >= 5threshold sets the minimum destination overlap to qualify as suspicious. Five is a reasonable starting point; raise to 10 if false positives dominate. - The self-join is O(n²) over destinations in the worst case. Partition by hour as shown — the per-hour subsets stay small.
- Output volume is typically a few thousand suspicious host pairs per day. The downstream ML layer narrows that to a handful of candidate clusters.
The Botnet Confidence Score
The final confidence score for a candidate cluster combines five signals:
Botnet Confidence = w₁ · jaccard_avg
+ w₂ · cosine_avg
+ w₃ · temporal_xcorr_avg
+ w₄ · cluster_silhouette
+ w₅ · shared_dst_count
initial weights: w₁=0.30, w₂=0.20, w₃=0.30, w₄=0.10, w₅=0.10
Weights can be tuned with logistic regression once you have a few dozen labelled clusters from past incidents or red-team exercises. Scores above 0.7 fire as confirmed-botnet alerts; scores between 0.5 and 0.7 go to the investigation queue.
Putting It Into Production
- VPC Flow Logs → S3 (Parquet partitioned by day).
- EventBridge → daily Athena query at 03:30 local produces the hourly host vectors and the shared-destination pairs.
- SageMaker Processing job loads the vectors and runs scikit-learn KMeans (with silhouette sweep) and SciPy hierarchical clustering. The temporal cross-correlation runs in NumPy.
- Candidate clusters → SNS / Kinesis → SIEM, with the destination set, member hosts, and confidence score attached.
- Weekly re-baseline of the population-level normalisation parameters.
For 5,000+ host environments, the SageMaker job needs roughly 4 GB of RAM and 10–20 minutes of wall time. Costs are negligible on spot pricing.
Detection Coverage Matrix
| Botnet pattern | Signature | Detection |
|---|---|---|
| IoT botnet (Mirai-family) | High flow count to non-RFC1918 destinations, shared C2 IPs | Strong — large clusters, high Jaccard, easy |
| Spam botnet (Emotet-style) | SMTP / web requests to shared list-host infrastructure | Strong |
| Residential proxy malware | Outbound connections to shared proxy gateways | Strong — proxy gateways aggregate flows clearly |
| Pre-DDoS staging | Synchronised DNS lookups to target FQDN | Trivial — DNS burst correlation fires |
| Sophisticated APT botnet (3–5 hosts) | Low-volume coordinated behaviour | Partial — needs Jaccard threshold relaxed; high analyst overhead |
| Living-off-the-land coordination | Distributed scripted activity through legitimate tooling | Uncovered — see post #5 (Markov kill chains) |
| Single bot in isolation | Behaviour matches only itself | Uncovered — needs per-host detection (posts #1, #3) |
Limits and False-Positive Sources
- Identical workload classes — Kubernetes pods running the same image will produce nearly identical behavioural fingerprints. Cluster by host role or pod label before alerting.
- Backup / monitoring fleets — every node in a monitoring cluster hits the same endpoints simultaneously. Tag at source.
- CI / CD runners — GitHub Actions runners, Jenkins agents, your source-control platform runners all show shared destinations during a build. Allow-list by source.
- SaaS app clusters behind a shared egress NAT — multiple internal hosts appearing as the same source IP to the SaaS provider, with shared destinations on the way out. Source-IP transparency at the NAT layer fixes this.
- Container start-up surges — autoscaling events create temporary clusters of nearly identical hosts. Time-bound suppression for the first 10 minutes of any new container.
MITRE ATT&CK Techniques Covered by This Detection
This pipeline targets the Resource Development (TA0042) and Impact (TA0040) tactics on the offensive side, with cross-references into Command and Control (TA0011). Botnet detection is a multi-host hunt — single-host telemetry will never surface it. The table below maps techniques to coverage so you can plan complementary hunts for the gaps.
| ATT&CK ID | Technique / sub-technique | Coverage | Hunter notes |
|---|---|---|---|
| T1583.005 | Acquire Infrastructure: Botnet (operator side) | Out of scope | Adversary-side technique — surfaces as shared C2 in your environment |
| T1584.005 | Compromise Infrastructure: Botnet (your hosts become bots) | Full | This is the actual target of the detection — internal bots |
| T1498 | Network Denial of Service (parent) | Full | Pre-staging behaviour visible in clustering |
| T1498.001 | Direct Network Flood | Full | Synchronised SYN / UDP / ICMP bursts cluster trivially |
| T1498.002 | Reflection Amplification | Partial | Outbound spoof-source not in VPC Flow Logs; pair with provider-side telemetry |
| T1499 | Endpoint Denial of Service (parent) | Partial | — |
| T1499.001 | OS Exhaustion Flood | Partial | — |
| T1499.002 | Service Exhaustion Flood | Partial | — |
| T1499.004 | Application or System Exploitation | Partial | — |
| T1071.001 | App Layer Protocol: Web Protocols (shared C2) | Full | Jaccard overlap on web destinations is the canonical signal |
| T1071.004 | App Layer Protocol: DNS (pre-DDoS staging) | Full | DNS-burst correlation feature is purpose-built |
| T1090 | Proxy (parent) | Full | Residential proxy gateways aggregate clearly |
| T1090.001 | Internal Proxy | Partial | — |
| T1090.002 | External Proxy | Full | — |
| T1090.003 | Multi-hop Proxy | Full | — |
| T1090.004 | Domain Fronting | Partial | Domain-level signal needs DNS+TLS metadata |
| T1095 | Non-Application Layer Protocol | Partial | Raw-socket bot coordination — covered for ICMP and UDP variants |
| T1568 | Dynamic Resolution (parent) | Partial | — |
| T1568.001 | Fast Flux DNS | Partial | Shared rapidly-rotating destinations surface in Jaccard |
| T1505 | Server Software Component (webshell coordination) | Out of scope | Host-side; pair with file-integrity monitoring |
Adversary emulation / purple-team validation. Botnet emulation is harder than single-host emulation because it inherently needs multiple lab hosts. Two practical options: (1) Spin up 5–10 EC2 instances in a sandbox VPC and run a synchronised Python script that polls a common C2 endpoint every 30 seconds with jitter — the cosine similarity and Jaccard scores should peak within one hour. (2) Use the open-source adversary-emulation frameworks “command-and-control” adversary with multiple agents to emulate the synchronised callback pattern. For DDoS staging specifically, public adversary-emulation atomics T1498 provides safe building blocks.
Sigma / detection-as-code. The cluster-level alert is unusual — most Sigma rules are per-host. Your SIEM needs to support multi-event correlation (your SIEM’s tstats, Elastic’s threshold rule, Sentinel’s NRT detection). Once the pipeline emits a botnet_cluster_id field tagging member hosts, downstream correlation rules become straightforward.
D3FEND mappings. This pipeline implements D3-NTCD (Network Traffic Community Deviation) across hosts rather than across flows — same defensive technique, different unit of analysis. Pair with D3-HSDN (Host Shutdown) for active response when bot containment is needed.
Where This Sits in a Mature Threat Hunting Programme
- Adaptive C2 beacon detection (FFT + DBSCAN) — per-host periodicity.
- Lateral movement graph detection — internal pivot graph.
- Low-and-slow data exfiltration detection — exfiltration end of the chain.
- VPC Flow Log attack hunting.
- Cloud attack threat hunting.
- Outbound network threat hunting.
- Hunting AWS identity attacks.
- AWS Bedrock CloudTrail playbook.
- Authentication-event threat hunting.
Closing Thoughts
Botnet detection by cross-host similarity catches what every per-host rule misses, and the maths is mature enough that the entire pipeline can be built on free libraries. The hardest part is the false-positive curation in the first month; after that, the pipeline runs essentially unattended.
If you find a fresh botnet cluster in your environment using the techniques in this post, share the IOCs with the community — they help everyone. Happy threat hunting.
#threathunting #botnetdetection #ddos #vpcflowlogs #awssecurity #cloudsecurity #kmeans #hierarchicalclustering #cosinesimilarity #jaccard #mitreattack #soc #blueteam #networkdetection #ml #detectionengineering










