Botnet Coordination & DDoS Staging Hunt — K-means + hierarchical clustering on VPC Flow Logs — HACKFORLAB cover image

Hunting Botnet Coordination and DDoS Staging with Clustering

Botnet Coordination & DDoS Staging Hunt — K-means + hierarchical clustering on VPC Flow Logs — HACKFORLAB cover image

From the hunt desk. If your environment has compromised internal hosts coordinated by an operator, single-host detections will not see it. The bots look normal individually because they are normal individually. The signal lives in cross-host similarity. This post is the unsupervised-clustering playbook for MITRE ATT&CK T1584.005 (Compromise Infrastructure: Botnet) and TA0040 (Impact) coverage — including pre-DDoS staging detection, the canonical signal that distinguishes a planning operator from background noise.

If your environment has been infected by a coordinated bot operator — IoT botnet, Emotet-style spambot, residential proxy network, or a pre-DDoS staging cluster — the individual bots will not look suspicious. Each one will produce traffic that falls cleanly inside the host’s own baseline. The attack is only visible when you stop looking at hosts in isolation and start asking: are any groups of hosts on my network behaving suspiciously similar to each other?

That question — “show me hosts whose network behaviour has converged” — is exactly what unsupervised clustering algorithms answer. This playbook walks through a pipeline that builds a normalised behavioural fingerprint per host per hour, computes pairwise cosine similarity across all hosts, runs K-means and hierarchical (Ward-linkage) clustering, then validates the clusters by computing temporal cross-correlation of traffic time-series across cluster members. Clusters of three or more hosts that share external destinations and have cross-correlation above 0.8 with near-zero lag are botnet candidates.

This is post #4 of our five-post VPC Flow Log detection-engineering series. The companions are adaptive C2 beacon detection (FFT + DBSCAN), lateral movement graph detection, and low-and-slow data exfiltration with Isolation Forest + LSTM. Post #5 closes the series on living-off-the-land kill chains.

Why Per-Host Botnet Detection Fails

The standard botnet detection signature — “a host has been observed talking to a known C2 IP” — depends on someone, somewhere, having already burned the C2 IP onto a threat-intel feed. For mature operators, by the time the IP is on a feed, the infrastructure has rotated. For commodity botnets, the volume of C2 IPs is too large to maintain. And for novel campaigns, the IPs are never on a feed at all.

The structural insight is that botnet members behave like each other more than they behave like themselves. A normal host’s traffic looks like its own historical traffic — the developer workstation today resembles the developer workstation yesterday. A botnet member’s traffic looks like the other botnet members today — which is a different beast entirely from itself yesterday. Standard per-host baselines miss this; cross-host similarity does not.

Three patterns in particular justify clustering-based detection:

  • Synchronised C2 callbacks. Botnets often poll a shared C2 at the same moment. Even with per-bot jitter, the cross-correlation of traffic time-series across members reveals the synchronisation.
  • Shared external destinations. All members of a botnet eventually hit the same operator infrastructure. Jaccard similarity over destination sets across host pairs reveals the overlap, even when no single destination is on a threat-intel feed.
  • Pre-DDoS staging. Before a DDoS attack, bots run DNS lookups and TCP probes against the target. Individual lookups are unremarkable; clusters of dozens of internal hosts running the same DNS queries within minutes are not.

Building the Per-Host Behavioural Fingerprint

For each srcAddr and each 1-hour window, we build a normalised feature vector. The 14 dimensions we use:

  1. Total outbound bytes
  2. Flow count
  3. Unique destination IP count
  4. Top-destination hash (a stable hash of the most-talked-to destination)
  5. Destination port distribution (vector across known service ports)
  6. Protocol ratio (TCP / UDP / ICMP / other)
  7. Average inter-arrival time
  8. Average packet size
  9. DNS query rate
  10. New-destination percentage
  11. HTTPS proportion of traffic
  12. HTTP proportion of traffic
  13. DNS proportion of traffic
  14. UDP proportion of traffic

Every feature is min-max normalised to [0, 1] so that no single feature dominates the similarity metric. Normalisation is per-feature, per-day — the bytes feature is rescaled against the daily population, not the host’s own history.

Cosine Similarity and Pairwise Behavioural Comparison

Once each host has a 14-dimensional vector for a given hour, we compute the pairwise cosine similarity between every host pair in the population. The metric is:

cos(A, B) = (A · B) / (‖A‖ · ‖B‖)

Cosine similarity ranges from −1 (opposite directions) to +1 (identical direction). For high-dimensional behavioural fingerprints, the operationally interesting threshold sits around 0.85 — host pairs with similarity above 0.85 in the same time window are operating in nearly identical behavioural states. In healthy environments, very few pairs cross this threshold; in a botnet-infected environment, you see dozens or hundreds.

For very large environments (10,000+ internal hosts), computing the full pairwise matrix is O(n²) and gets expensive. Two mitigations:

  • Locality-sensitive hashing (LSH) approximates the nearest-neighbour search in sub-quadratic time. Worth the complexity at > 5,000 hosts.
  • Sliding-window restriction — only compute similarities within hosts that have non-zero outbound activity in the same hour. The off-hours subset is small.

K-Means and Hierarchical Clustering

From the behavioural vectors and the pairwise similarities, two clustering algorithms run in parallel:

K-means with k selected automatically via the silhouette score. The silhouette score for a point i is:

s(i) = (b(i) − a(i)) / max(a(i), b(i))

where  a(i) = mean distance from i to other points in its cluster
       b(i) = mean distance from i to points in the nearest other cluster

We sweep k from 2 to 30 and pick the k that maximises the average silhouette. Silhouettes above 0.7 indicate strong, well-separated clusters; values below 0.5 suggest the partition is weak. A botnet-infected environment typically produces one or more very high-silhouette clusters embedded in an otherwise low-silhouette population.

Hierarchical clustering with Ward linkage builds a dendrogram showing which hosts merge into clusters first (most similar) and last (most distinct). The dendrogram is visual gold for analyst review — a tight branch of 8 hosts that fuses at very low distance is a strong botnet candidate even before any temporal validation.

The two algorithms catch different cases. K-means is fast, scales well, and produces clear cluster assignments. Hierarchical clustering captures nested structure (sub-clusters within larger clusters) and is more robust to non-spherical cluster shapes. Run both, take the intersection.

Temporal Validation and Shared-Destination Analysis

A cluster of behaviourally similar hosts is necessary but not sufficient. The final validation steps are:

Temporal cross-correlation. For each pair of hosts in a cluster, compute the cross-correlation of their flow-rate time series:

corr(A, B) = Σ_t (flow_rate_A(t) · flow_rate_B(t))
           / √(Σ_t flow_rate_A(t)² · Σ_t flow_rate_B(t)²)

Cross-correlation above 0.8 with a near-zero lag (≤ 60 seconds) is strong evidence of synchronisation. Real bots receiving the same C2 command produce exactly this signature.

Shared external destinations (Jaccard similarity). The destination-set Jaccard similarity between host pairs is:

J(A, B) = |dst_set_A ∩ dst_set_B| / |dst_set_A ∪ dst_set_B|

A Jaccard score above 0.5 for three or more host pairs within the same cluster, combined with at least one shared external destination, fires the high-confidence botnet alert. The shared destination is extracted automatically and pushed to the threat-intel layer as an emerging C2 candidate.

Feature Engineering from VPC Flow Logs

Feature Source attributes Formula What it captures
Behaviour vector all outbound attributes 14-dim normalised vector per host / hour Host traffic fingerprint
Destination overlap srcAddr, dstAddr |dst_A ∩ dst_B| / |dst_A ∪ dst_B| Shared C2 infrastructure (Jaccard)
Temporal sync score start per srcAddr cross_correlation(flow_rate_A, flow_rate_B) Synchronised bot activity
Port distribution vector dstPort [%p80, %p443, %p53, %p_other] per host Service-access fingerprint
Volume anomaly bytes per host (bytes_current − μ) / σ per host Individual host deviation
DNS burst correlation dstPort = 53, start correlation(dns_rate_A, dns_rate_B) Pre-DDoS DNS staging detection
REJECT correlation action = REJECT correlation(reject_rate_A, reject_rate_B) Synchronised scanning / probing

Athena SQL — Host Behaviour Vectorisation

WITH host_hourly AS (
    SELECT srcaddr,
           DATE_FORMAT(from_unixtime(start), '%Y-%m-%d %H') AS hour,
           SUM(bytes)                                       AS bytes_out,
           COUNT(*)                                         AS flow_count,
           COUNT(DISTINCT dstaddr)                          AS unique_dsts,
           COUNT(DISTINCT dstport)                          AS unique_ports,
           AVG(bytes)                                       AS avg_flow_bytes,
           SUM(CASE WHEN dstport = 443 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS pct_https,
           SUM(CASE WHEN dstport = 53  THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS pct_dns,
           SUM(CASE WHEN dstport = 80  THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS pct_http,
           SUM(CASE WHEN protocol = 17 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS pct_udp
    FROM central_vpc_flow_logs
    WHERE action = 'ACCEPT' AND srcaddr LIKE '10.%'
      AND dstaddr NOT LIKE '10.%' AND dstaddr NOT LIKE '172.%'
      AND day BETWEEN '2026/03/19' AND '2026/03/23'
    GROUP BY srcaddr, DATE_FORMAT(from_unixtime(start), '%Y-%m-%d %H')
),
shared_dests AS (
    SELECT a.srcaddr AS host_a, b.srcaddr AS host_b, a.hour,
           COUNT(DISTINCT a_dst) AS shared_destinations
    FROM (
        SELECT srcaddr, DATE_FORMAT(from_unixtime(start), '%Y-%m-%d %H') AS hour,
               dstaddr AS a_dst
        FROM central_vpc_flow_logs
        WHERE action='ACCEPT' AND srcaddr LIKE '10.%' AND dstaddr NOT LIKE '10.%'
          AND day BETWEEN '2026/03/19' AND '2026/03/23'
    ) a
    JOIN (
        SELECT srcaddr, DATE_FORMAT(from_unixtime(start), '%Y-%m-%d %H') AS hour,
               dstaddr AS b_dst
        FROM central_vpc_flow_logs
        WHERE action='ACCEPT' AND srcaddr LIKE '10.%' AND dstaddr NOT LIKE '10.%'
          AND day BETWEEN '2026/03/19' AND '2026/03/23'
    ) b
    ON a.hour = b.hour AND a.a_dst = b.b_dst AND a.srcaddr < b.srcaddr
    GROUP BY a.srcaddr, b.srcaddr, a.hour
    HAVING COUNT(DISTINCT a_dst) >= 5
)
SELECT * FROM shared_dests ORDER BY shared_destinations DESC;

Tuning notes:

  • The HAVING COUNT(DISTINCT a_dst) >= 5 threshold sets the minimum destination overlap to qualify as suspicious. Five is a reasonable starting point; raise to 10 if false positives dominate.
  • The self-join is O(n²) over destinations in the worst case. Partition by hour as shown — the per-hour subsets stay small.
  • Output volume is typically a few thousand suspicious host pairs per day. The downstream ML layer narrows that to a handful of candidate clusters.

The Botnet Confidence Score

The final confidence score for a candidate cluster combines five signals:

Botnet Confidence = w₁ · jaccard_avg
                  + w₂ · cosine_avg
                  + w₃ · temporal_xcorr_avg
                  + w₄ · cluster_silhouette
                  + w₅ · shared_dst_count

initial weights:  w₁=0.30, w₂=0.20, w₃=0.30, w₄=0.10, w₅=0.10

Weights can be tuned with logistic regression once you have a few dozen labelled clusters from past incidents or red-team exercises. Scores above 0.7 fire as confirmed-botnet alerts; scores between 0.5 and 0.7 go to the investigation queue.

Putting It Into Production

  1. VPC Flow Logs → S3 (Parquet partitioned by day).
  2. EventBridge → daily Athena query at 03:30 local produces the hourly host vectors and the shared-destination pairs.
  3. SageMaker Processing job loads the vectors and runs scikit-learn KMeans (with silhouette sweep) and SciPy hierarchical clustering. The temporal cross-correlation runs in NumPy.
  4. Candidate clusters → SNS / Kinesis → SIEM, with the destination set, member hosts, and confidence score attached.
  5. Weekly re-baseline of the population-level normalisation parameters.

For 5,000+ host environments, the SageMaker job needs roughly 4 GB of RAM and 10–20 minutes of wall time. Costs are negligible on spot pricing.

Detection Coverage Matrix

Botnet pattern Signature Detection
IoT botnet (Mirai-family) High flow count to non-RFC1918 destinations, shared C2 IPs Strong — large clusters, high Jaccard, easy
Spam botnet (Emotet-style) SMTP / web requests to shared list-host infrastructure Strong
Residential proxy malware Outbound connections to shared proxy gateways Strong — proxy gateways aggregate flows clearly
Pre-DDoS staging Synchronised DNS lookups to target FQDN Trivial — DNS burst correlation fires
Sophisticated APT botnet (3–5 hosts) Low-volume coordinated behaviour Partial — needs Jaccard threshold relaxed; high analyst overhead
Living-off-the-land coordination Distributed scripted activity through legitimate tooling Uncovered — see post #5 (Markov kill chains)
Single bot in isolation Behaviour matches only itself Uncovered — needs per-host detection (posts #1, #3)

Limits and False-Positive Sources

  • Identical workload classes — Kubernetes pods running the same image will produce nearly identical behavioural fingerprints. Cluster by host role or pod label before alerting.
  • Backup / monitoring fleets — every node in a monitoring cluster hits the same endpoints simultaneously. Tag at source.
  • CI / CD runners — GitHub Actions runners, Jenkins agents, your source-control platform runners all show shared destinations during a build. Allow-list by source.
  • SaaS app clusters behind a shared egress NAT — multiple internal hosts appearing as the same source IP to the SaaS provider, with shared destinations on the way out. Source-IP transparency at the NAT layer fixes this.
  • Container start-up surges — autoscaling events create temporary clusters of nearly identical hosts. Time-bound suppression for the first 10 minutes of any new container.

MITRE ATT&CK Techniques Covered by This Detection

This pipeline targets the Resource Development (TA0042) and Impact (TA0040) tactics on the offensive side, with cross-references into Command and Control (TA0011). Botnet detection is a multi-host hunt — single-host telemetry will never surface it. The table below maps techniques to coverage so you can plan complementary hunts for the gaps.

ATT&CK ID Technique / sub-technique Coverage Hunter notes
T1583.005 Acquire Infrastructure: Botnet (operator side) Out of scope Adversary-side technique — surfaces as shared C2 in your environment
T1584.005 Compromise Infrastructure: Botnet (your hosts become bots) Full This is the actual target of the detection — internal bots
T1498 Network Denial of Service (parent) Full Pre-staging behaviour visible in clustering
T1498.001 Direct Network Flood Full Synchronised SYN / UDP / ICMP bursts cluster trivially
T1498.002 Reflection Amplification Partial Outbound spoof-source not in VPC Flow Logs; pair with provider-side telemetry
T1499 Endpoint Denial of Service (parent) Partial
T1499.001 OS Exhaustion Flood Partial
T1499.002 Service Exhaustion Flood Partial
T1499.004 Application or System Exploitation Partial
T1071.001 App Layer Protocol: Web Protocols (shared C2) Full Jaccard overlap on web destinations is the canonical signal
T1071.004 App Layer Protocol: DNS (pre-DDoS staging) Full DNS-burst correlation feature is purpose-built
T1090 Proxy (parent) Full Residential proxy gateways aggregate clearly
T1090.001 Internal Proxy Partial
T1090.002 External Proxy Full
T1090.003 Multi-hop Proxy Full
T1090.004 Domain Fronting Partial Domain-level signal needs DNS+TLS metadata
T1095 Non-Application Layer Protocol Partial Raw-socket bot coordination — covered for ICMP and UDP variants
T1568 Dynamic Resolution (parent) Partial
T1568.001 Fast Flux DNS Partial Shared rapidly-rotating destinations surface in Jaccard
T1505 Server Software Component (webshell coordination) Out of scope Host-side; pair with file-integrity monitoring

Adversary emulation / purple-team validation. Botnet emulation is harder than single-host emulation because it inherently needs multiple lab hosts. Two practical options: (1) Spin up 5–10 EC2 instances in a sandbox VPC and run a synchronised Python script that polls a common C2 endpoint every 30 seconds with jitter — the cosine similarity and Jaccard scores should peak within one hour. (2) Use the open-source adversary-emulation frameworks “command-and-control” adversary with multiple agents to emulate the synchronised callback pattern. For DDoS staging specifically, public adversary-emulation atomics T1498 provides safe building blocks.

Sigma / detection-as-code. The cluster-level alert is unusual — most Sigma rules are per-host. Your SIEM needs to support multi-event correlation (your SIEM’s tstats, Elastic’s threshold rule, Sentinel’s NRT detection). Once the pipeline emits a botnet_cluster_id field tagging member hosts, downstream correlation rules become straightforward.

D3FEND mappings. This pipeline implements D3-NTCD (Network Traffic Community Deviation) across hosts rather than across flows — same defensive technique, different unit of analysis. Pair with D3-HSDN (Host Shutdown) for active response when bot containment is needed.

Where This Sits in a Mature Threat Hunting Programme

Closing Thoughts

Botnet detection by cross-host similarity catches what every per-host rule misses, and the maths is mature enough that the entire pipeline can be built on free libraries. The hardest part is the false-positive curation in the first month; after that, the pipeline runs essentially unattended.

If you find a fresh botnet cluster in your environment using the techniques in this post, share the IOCs with the community — they help everyone. Happy threat hunting.

#threathunting #botnetdetection #ddos #vpcflowlogs #awssecurity #cloudsecurity #kmeans #hierarchicalclustering #cosinesimilarity #jaccard #mitreattack #soc #blueteam #networkdetection #ml #detectionengineering

Core Working Areas :- Threat Intelligence, Digital Forensics, Incident Response, Fraud Investigation, Web Application Security Technical Certifications :- Computer Hacking Forensics Investigator | Certified Ethical Hacker | Certified Cyber crime investigator | Certified Professional Hacker | Certified Professional Forensics Analyst | Redhat certified Engineer | Cisco Certified Network Associates | Certified Firewall Solutions | Certified Network Monitoring Solution | Certified Proxy Solutions

Leave a Reply

Your email address will not be published. Required fields are marked *

Enter Captcha Here : *

Reload Image