Hunting Botnet Coordination and DDoS Staging with Clustering

Botnet Coordination & DDoS Staging Hunt — K-means + hierarchical clustering on VPC Flow Logs — HACKFORLAB cover image

From the hunt desk. If your environment has compromised internal hosts coordinated by an operator, single-host detections will not see it. The bots look normal individually because they are normal individually. The signal lives in cross-host similarity. This post is the unsupervised-clustering playbook for MITRE ATT&CK T1584.005 (Compromise Infrastructure: Botnet) and TA0040 (Impact) coverage — including pre-DDoS staging detection, the canonical signal that distinguishes a planning operator from background noise.

If your environment has been infected by a coordinated bot operator — IoT botnet, Emotet-style spambot, residential proxy network, or a pre-DDoS staging cluster — the individual bots will not look suspicious. Each one will produce traffic that falls cleanly inside the host’s own baseline. The attack is only visible when you stop looking at hosts in isolation and start asking: are any groups of hosts on my network behaving suspiciously similar to each other?

That question — “show me hosts whose network behaviour has converged” — is exactly what unsupervised clustering algorithms answer. This playbook walks through a pipeline that builds a normalised behavioural fingerprint per host per hour, computes pairwise cosine similarity across all hosts, runs K-means and hierarchical (Ward-linkage) clustering, then validates the clusters by computing temporal cross-correlation of traffic time-series across cluster members. Clusters of three or more hosts that share external destinations and have cross-correlation above 0.8 with near-zero lag are botnet candidates.

This is post #4 of our five-post VPC Flow Log detection-engineering series. The companions are adaptive C2 beacon detection (FFT + DBSCAN), lateral movement graph detection, and low-and-slow data exfiltration with Isolation Forest + LSTM. Post #5 closes the series on living-off-the-land kill chains.

Why Per-Host Botnet Detection Fails

The standard botnet detection signature — “a host has been observed talking to a known C2 IP” — depends on someone, somewhere, having already burned the C2 IP onto a threat-intel feed. For mature operators, by the time the IP is on a feed, the infrastructure has rotated. For commodity botnets, the volume of C2 IPs is too large to maintain. And for novel campaigns, the IPs are never on a feed at all.

The structural insight is that botnet members behave like each other more than they behave like themselves. A normal host’s traffic looks like its own historical traffic — the developer workstation today resembles the developer workstation yesterday. A botnet member’s traffic looks like the other botnet members today — which is a different beast entirely from itself yesterday. Standard per-host baselines miss this; cross-host similarity does not.

Three patterns in particular justify clustering-based detection:

Synchronised C2 callbacks. Botnets often poll a shared C2 at the same moment. Even with per-bot jitter, the cross-correlation of traffic time-series across members reveals the synchronisation.
Shared external destinations. All members of a botnet eventually hit the same operator infrastructure. Jaccard similarity over destination sets across host pairs reveals the overlap, even when no single destination is on a threat-intel feed.
Pre-DDoS staging. Before a DDoS attack, bots run DNS lookups and TCP probes against the target. Individual lookups are unremarkable; clusters of dozens of internal hosts running the same DNS queries within minutes are not.

Building the Per-Host Behavioural Fingerprint

For each srcAddr and each 1-hour window, we build a normalised feature vector. The 14 dimensions we use:

Total outbound bytes
Flow count
Unique destination IP count
Top-destination hash (a stable hash of the most-talked-to destination)
Destination port distribution (vector across known service ports)
Protocol ratio (TCP / UDP / ICMP / other)
Average inter-arrival time
Average packet size
DNS query rate
New-destination percentage
HTTPS proportion of traffic
HTTP proportion of traffic
DNS proportion of traffic
UDP proportion of traffic

Every feature is min-max normalised to [0, 1] so that no single feature dominates the similarity metric. Normalisation is per-feature, per-day — the bytes feature is rescaled against the daily population, not the host’s own history.

Cosine Similarity and Pairwise Behavioural Comparison

Once each host has a 14-dimensional vector for a given hour, we compute the pairwise cosine similarity between every host pair in the population. The metric is:

cos(A, B) = (A · B) / (‖A‖ · ‖B‖)

Cosine similarity ranges from −1 (opposite directions) to +1 (identical direction). For high-dimensional behavioural fingerprints, the operationally interesting threshold sits around 0.85 — host pairs with similarity above 0.85 in the same time window are operating in nearly identical behavioural states. In healthy environments, very few pairs cross this threshold; in a botnet-infected environment, you see dozens or hundreds.

For very large environments (10,000+ internal hosts), computing the full pairwise matrix is O(n²) and gets expensive. Two mitigations:

Locality-sensitive hashing (LSH) approximates the nearest-neighbour search in sub-quadratic time. Worth the complexity at > 5,000 hosts.
Sliding-window restriction — only compute similarities within hosts that have non-zero outbound activity in the same hour. The off-hours subset is small.

K-Means and Hierarchical Clustering

From the behavioural vectors and the pairwise similarities, two clustering algorithms run in parallel:

K-means with k selected automatically via the silhouette score. The silhouette score for a point i is:

s(i) = (b(i) − a(i)) / max(a(i), b(i))

where  a(i) = mean distance from i to other points in its cluster
       b(i) = mean distance from i to points in the nearest other cluster

We sweep k from 2 to 30 and pick the k that maximises the average silhouette. Silhouettes above 0.7 indicate strong, well-separated clusters; values below 0.5 suggest the partition is weak. A botnet-infected environment typically produces one or more very high-silhouette clusters embedded in an otherwise low-silhouette population.

Hierarchical clustering with Ward linkage builds a dendrogram showing which hosts merge into clusters first (most similar) and last (most distinct). The dendrogram is visual gold for analyst review — a tight branch of 8 hosts that fuses at very low distance is a strong botnet candidate even before any temporal validation.

The two algorithms catch different cases. K-means is fast, scales well, and produces clear cluster assignments. Hierarchical clustering captures nested structure (sub-clusters within larger clusters) and is more robust to non-spherical cluster shapes. Run both, take the intersection.

Temporal Validation and Shared-Destination Analysis

A cluster of behaviourally similar hosts is necessary but not sufficient. The final validation steps are:

Temporal cross-correlation. For each pair of hosts in a cluster, compute the cross-correlation of their flow-rate time series:

corr(A, B) = Σ_t (flow_rate_A(t) · flow_rate_B(t))
           / √(Σ_t flow_rate_A(t)² · Σ_t flow_rate_B(t)²)

Cross-correlation above 0.8 with a near-zero lag (≤ 60 seconds) is strong evidence of synchronisation. Real bots receiving the same C2 command produce exactly this signature.

Shared external destinations (Jaccard similarity). The destination-set Jaccard similarity between host pairs is:

J(A, B) = |dst_set_A ∩ dst_set_B| / |dst_set_A ∪ dst_set_B|

A Jaccard score above 0.5 for three or more host pairs within the same cluster, combined with at least one shared external destination, fires the high-confidence botnet alert. The shared destination is extracted automatically and pushed to the threat-intel layer as an emerging C2 candidate.

Feature Engineering from VPC Flow Logs

Feature	Source attributes	Formula	What it captures
Behaviour vector	all outbound attributes	14-dim normalised vector per host / hour	Host traffic fingerprint
Destination overlap	srcAddr, dstAddr	\|dst_A ∩ dst_B\| / \|dst_A ∪ dst_B\|	Shared C2 infrastructure (Jaccard)
Temporal sync score	start per srcAddr	cross_correlation(flow_rate_A, flow_rate_B)	Synchronised bot activity
Port distribution vector	dstPort	[%p80, %p443, %p53, %p_other] per host	Service-access fingerprint
Volume anomaly	bytes per host	(bytes_current − μ) / σ per host	Individual host deviation
DNS burst correlation	dstPort = 53, start	correlation(dns_rate_A, dns_rate_B)	Pre-DDoS DNS staging detection
REJECT correlation	action = REJECT	correlation(reject_rate_A, reject_rate_B)	Synchronised scanning / probing

Athena SQL — Host Behaviour Vectorisation

WITH host_hourly AS (
    SELECT srcaddr,
           DATE_FORMAT(from_unixtime(start), '%Y-%m-%d %H') AS hour,
           SUM(bytes)                                       AS bytes_out,
           COUNT(*)                                         AS flow_count,
           COUNT(DISTINCT dstaddr)                          AS unique_dsts,
           COUNT(DISTINCT dstport)                          AS unique_ports,
           AVG(bytes)                                       AS avg_flow_bytes,
           SUM(CASE WHEN dstport = 443 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS pct_https,
           SUM(CASE WHEN dstport = 53  THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS pct_dns,
           SUM(CASE WHEN dstport = 80  THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS pct_http,
           SUM(CASE WHEN protocol = 17 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS pct_udp
    FROM central_vpc_flow_logs
    WHERE action = 'ACCEPT' AND srcaddr LIKE '10.%'
      AND dstaddr NOT LIKE '10.%' AND dstaddr NOT LIKE '172.%'
      AND day BETWEEN '2026/03/19' AND '2026/03/23'
    GROUP BY srcaddr, DATE_FORMAT(from_unixtime(start), '%Y-%m-%d %H')
),
shared_dests AS (
    SELECT a.srcaddr AS host_a, b.srcaddr AS host_b, a.hour,
           COUNT(DISTINCT a_dst) AS shared_destinations
    FROM (
        SELECT srcaddr, DATE_FORMAT(from_unixtime(start), '%Y-%m-%d %H') AS hour,
               dstaddr AS a_dst
        FROM central_vpc_flow_logs
        WHERE action='ACCEPT' AND srcaddr LIKE '10.%' AND dstaddr NOT LIKE '10.%'
          AND day BETWEEN '2026/03/19' AND '2026/03/23'
    ) a
    JOIN (
        SELECT srcaddr, DATE_FORMAT(from_unixtime(start), '%Y-%m-%d %H') AS hour,
               dstaddr AS b_dst
        FROM central_vpc_flow_logs
        WHERE action='ACCEPT' AND srcaddr LIKE '10.%' AND dstaddr NOT LIKE '10.%'
          AND day BETWEEN '2026/03/19' AND '2026/03/23'
    ) b
    ON a.hour = b.hour AND a.a_dst = b.b_dst AND a.srcaddr < b.srcaddr
    GROUP BY a.srcaddr, b.srcaddr, a.hour
    HAVING COUNT(DISTINCT a_dst) >= 5
)
SELECT * FROM shared_dests ORDER BY shared_destinations DESC;

Tuning notes:

The HAVING COUNT(DISTINCT a_dst) >= 5 threshold sets the minimum destination overlap to qualify as suspicious. Five is a reasonable starting point; raise to 10 if false positives dominate.
The self-join is O(n²) over destinations in the worst case. Partition by hour as shown — the per-hour subsets stay small.
Output volume is typically a few thousand suspicious host pairs per day. The downstream ML layer narrows that to a handful of candidate clusters.

The Botnet Confidence Score

The final confidence score for a candidate cluster combines five signals:

Botnet Confidence = w₁ · jaccard_avg
                  + w₂ · cosine_avg
                  + w₃ · temporal_xcorr_avg
                  + w₄ · cluster_silhouette
                  + w₅ · shared_dst_count

initial weights:  w₁=0.30, w₂=0.20, w₃=0.30, w₄=0.10, w₅=0.10

Weights can be tuned with logistic regression once you have a few dozen labelled clusters from past incidents or red-team exercises. Scores above 0.7 fire as confirmed-botnet alerts; scores between 0.5 and 0.7 go to the investigation queue.

Putting It Into Production

VPC Flow Logs → S3 (Parquet partitioned by day).
EventBridge → daily Athena query at 03:30 local produces the hourly host vectors and the shared-destination pairs.
SageMaker Processing job loads the vectors and runs scikit-learn KMeans (with silhouette sweep) and SciPy hierarchical clustering. The temporal cross-correlation runs in NumPy.
Candidate clusters → SNS / Kinesis → SIEM, with the destination set, member hosts, and confidence score attached.
Weekly re-baseline of the population-level normalisation parameters.

For 5,000+ host environments, the SageMaker job needs roughly 4 GB of RAM and 10–20 minutes of wall time. Costs are negligible on spot pricing.

Detection Coverage Matrix

Botnet pattern	Signature	Detection
IoT botnet (Mirai-family)	High flow count to non-RFC1918 destinations, shared C2 IPs	Strong — large clusters, high Jaccard, easy
Spam botnet (Emotet-style)	SMTP / web requests to shared list-host infrastructure	Strong
Residential proxy malware	Outbound connections to shared proxy gateways	Strong — proxy gateways aggregate flows clearly
Pre-DDoS staging	Synchronised DNS lookups to target FQDN	Trivial — DNS burst correlation fires
Sophisticated APT botnet (3–5 hosts)	Low-volume coordinated behaviour	Partial — needs Jaccard threshold relaxed; high analyst overhead
Living-off-the-land coordination	Distributed scripted activity through legitimate tooling	Uncovered — see post #5 (Markov kill chains)
Single bot in isolation	Behaviour matches only itself	Uncovered — needs per-host detection (posts #1, #3)

Limits and False-Positive Sources

Identical workload classes — Kubernetes pods running the same image will produce nearly identical behavioural fingerprints. Cluster by host role or pod label before alerting.
Backup / monitoring fleets — every node in a monitoring cluster hits the same endpoints simultaneously. Tag at source.
CI / CD runners — GitHub Actions runners, Jenkins agents, your source-control platform runners all show shared destinations during a build. Allow-list by source.
SaaS app clusters behind a shared egress NAT — multiple internal hosts appearing as the same source IP to the SaaS provider, with shared destinations on the way out. Source-IP transparency at the NAT layer fixes this.
Container start-up surges — autoscaling events create temporary clusters of nearly identical hosts. Time-bound suppression for the first 10 minutes of any new container.

MITRE ATT&CK Techniques Covered by This Detection

This pipeline targets the Resource Development (TA0042) and Impact (TA0040) tactics on the offensive side, with cross-references into Command and Control (TA0011). Botnet detection is a multi-host hunt — single-host telemetry will never surface it. The table below maps techniques to coverage so you can plan complementary hunts for the gaps.

ATT&CK ID	Technique / sub-technique	Coverage	Hunter notes
T1583.005	Acquire Infrastructure: Botnet (operator side)	Out of scope	Adversary-side technique — surfaces as shared C2 in your environment
T1584.005	Compromise Infrastructure: Botnet (your hosts become bots)	Full	This is the actual target of the detection — internal bots
T1498	Network Denial of Service (parent)	Full	Pre-staging behaviour visible in clustering
T1498.001	Direct Network Flood	Full	Synchronised SYN / UDP / ICMP bursts cluster trivially
T1498.002	Reflection Amplification	Partial	Outbound spoof-source not in VPC Flow Logs; pair with provider-side telemetry
T1499	Endpoint Denial of Service (parent)	Partial	—
T1499.001	OS Exhaustion Flood	Partial	—
T1499.002	Service Exhaustion Flood	Partial	—
T1499.004	Application or System Exploitation	Partial	—
T1071.001	App Layer Protocol: Web Protocols (shared C2)	Full	Jaccard overlap on web destinations is the canonical signal
T1071.004	App Layer Protocol: DNS (pre-DDoS staging)	Full	DNS-burst correlation feature is purpose-built
T1090	Proxy (parent)	Full	Residential proxy gateways aggregate clearly
T1090.001	Internal Proxy	Partial	—
T1090.002	External Proxy	Full	—
T1090.003	Multi-hop Proxy	Full	—
T1090.004	Domain Fronting	Partial	Domain-level signal needs DNS+TLS metadata
T1095	Non-Application Layer Protocol	Partial	Raw-socket bot coordination — covered for ICMP and UDP variants
T1568	Dynamic Resolution (parent)	Partial	—
T1568.001	Fast Flux DNS	Partial	Shared rapidly-rotating destinations surface in Jaccard
T1505	Server Software Component (webshell coordination)	Out of scope	Host-side; pair with file-integrity monitoring

Adversary emulation / purple-team validation. Botnet emulation is harder than single-host emulation because it inherently needs multiple lab hosts. Two practical options: (1) Spin up 5–10 EC2 instances in a sandbox VPC and run a synchronised Python script that polls a common C2 endpoint every 30 seconds with jitter — the cosine similarity and Jaccard scores should peak within one hour. (2) Use the open-source adversary-emulation frameworks “command-and-control” adversary with multiple agents to emulate the synchronised callback pattern. For DDoS staging specifically, public adversary-emulation atomics T1498 provides safe building blocks.

Sigma / detection-as-code. The cluster-level alert is unusual — most Sigma rules are per-host. Your SIEM needs to support multi-event correlation (your SIEM’s tstats, Elastic’s threshold rule, Sentinel’s NRT detection). Once the pipeline emits a botnet_cluster_id field tagging member hosts, downstream correlation rules become straightforward.

D3FEND mappings. This pipeline implements D3-NTCD (Network Traffic Community Deviation) across hosts rather than across flows — same defensive technique, different unit of analysis. Pair with D3-HSDN (Host Shutdown) for active response when bot containment is needed.

Where This Sits in a Mature Threat Hunting Programme

Adaptive C2 beacon detection (FFT + DBSCAN) — per-host periodicity.
Lateral movement graph detection — internal pivot graph.
Low-and-slow data exfiltration detection — exfiltration end of the chain.
VPC Flow Log attack hunting.
Cloud attack threat hunting.
Outbound network threat hunting.
Hunting AWS identity attacks.
AWS Bedrock CloudTrail playbook.
Authentication-event threat hunting.

Closing Thoughts

Botnet detection by cross-host similarity catches what every per-host rule misses, and the maths is mature enough that the entire pipeline can be built on free libraries. The hardest part is the false-positive curation in the first month; after that, the pipeline runs essentially unattended.

If you find a fresh botnet cluster in your environment using the techniques in this post, share the IOCs with the community — they help everyone. Happy threat hunting.

#threathunting #botnetdetection #ddos #vpcflowlogs #awssecurity #cloudsecurity #kmeans #hierarchicalclustering #cosinesimilarity #jaccard #mitreattack #soc #blueteam #networkdetection #ml #detectionengineering