CLOUD THREAT HUNTING / LOG STRATEGY

What Cloud Logs You Actually Need for Threat Hunting (And Why Most Teams Fail)

Q: What is the minimum viable log set for cloud threat hunting?

Management-plane audit logs (CloudTrail / Azure Activity / GCP Audit), identity sign-in logs (IAM, Entra ID, Cloud Identity), and DNS query logs. These three sources cover roughly 70-80 percent of MITRE Cloud Matrix techniques at moderate cost. VPC Flow Logs and storage data events are second-tier additions justified by specific TTPs.

Q: How long should cloud logs be retained for hunting?

Industry median dwell time is 84-110 days. Hot retention of 90 days for management and identity logs, plus cold (queryable) retention of 365 days, is the practical floor. Less than that, and you cannot reconstruct multi-month persistence campaigns.

Cloud telemetry is not a checkbox. It is a portfolio decision with cost, coverage, and retention as the three axes. Get the portfolio wrong and your hunts will produce empty result sets against real intrusions. This is the operator’s guide to the log set that actually works.

Operator note. The dollar figures in this article are order-of-magnitude estimates aggregated from published cloud provider pricing pages and observed customer ingest patterns — treat them as planning guidance, not invoices. The coverage percentages are derived from mapping logged event types to MITRE ATT&CK Cloud Matrix techniques. For your own tenant, model both before committing to a retention contract. Run a 30-day measured ingest pilot through the HackForLab platform → to get your actual numbers.

Why Most Cloud Threat Hunting Programs Fail

The pattern is so consistent we have a shorthand for it internally: logs-rich, hunt-poor. The customer is paying seven figures a year for log ingest. Their SIEM contains terabytes a day. And yet, when an incident hits, the hunter cannot answer the questions that matter: which identity assumed the role at 02:14 UTC, what role chain did they walk, which object did they read.

Three failure modes recur:

Wrong logs. Teams ingest VPC Flow Logs (cheap-per-flow, expensive-per-TB) and skip identity sign-in events (the actual epicentre of cloud attack chains). The result is network-heavy, identity-blind telemetry.
Right logs, wrong fields. The team ingests CloudTrail but discards requestParameters, userAgent, or sessionContext to save space. The hunt query that needed those fields returns nothing — not because the activity did not happen, but because the evidence was thrown away upstream.
Right logs, wrong window. Mandiant’s median dwell time has been 84-110 days for years. If your hot retention is 30 days and your cold retention is searchable only via an offline restore, your hunt cannot span the actual lifespan of the intrusion.

Each of those failure modes costs nothing to fix at architecture time and costs a forensic engagement to discover after the fact.

See your own log coverage gaps

The platform maps your ingested log set against the MITRE Cloud Matrix and shows you which techniques you cannot detect today.

Launch coverage scan →

The Log Dependency Map: Technique → Log → Cost

Every detection or hunt query has a log dependency. If the log is not ingested with the right fields and retention, the detection does not exist — it is a stub. The map below shows the relationship between a sample of cloud-relevant ATT&CK techniques and the log source required to hunt them.

Log dependency map: ATT&CK technique to required cloud log source to hunt viability

Figure 1 — Hunt viability collapses when any link in the chain is missing

Three observations matter for budgeting. First, identity logs (sign-in, IAM) and the management-plane audit log (CloudTrail / Activity Log / GCP Audit) carry the heaviest load — they are required for most persistence, privilege escalation, and lateral-movement techniques. Second, VPC Flow Logs cover surprisingly few cloud-native techniques on their own. They are essential for east-west crypto-mining or beacon detection, but for cloud-control-plane attacks they are largely orthogonal. Third, storage data events (S3, Blob, GCS) are the most expensive single source — enabling them globally is almost never the right call. Enable per-bucket, scoped to crown-jewel data.

AWS / Azure / GCP Coverage Matrix

The cross-cloud coverage matrix below maps a selection of MITRE ATT&CK Cloud Matrix techniques to the equivalent log sources in AWS, Azure, and GCP. Cell values indicate practical coverage: FULL means the log routinely contains the field needed to detect the technique; PART means coverage requires combining multiple log streams; NONE means the technique cannot be detected from this source.

Technique (ID)	AWS	Azure	GCP
T1078.004 — Cloud Accounts (valid creds)	CloudTrail + IAM — FULL	Sign-In Logs + AAD Audit — FULL	Cloud Audit (Admin Activity) — FULL
T1098.001 — Additional Cloud Credentials	CloudTrail (CreateAccessKey) — FULL	AAD Audit (Application credential) — FULL	Cloud Audit (CreateServiceAccountKey) — FULL
T1098.003 — Additional Cloud Roles	CloudTrail (AttachRolePolicy) — FULL	AAD Audit (Add member to role) — FULL	Cloud Audit (SetIamPolicy) — FULL
T1199 — Trusted Relationship (cross-account)	CloudTrail (AssumeRole, sts:ExternalId) — FULL	Activity Log (delegated permissions) — PART	Cloud Audit (Workload Identity Federation) — PART
T1530 — Data from Cloud Storage Object	S3 Data Events (paid) — PART	Storage Diagnostics Logs — PART	Cloud Audit (Data Access, opt-in) — PART
T1537 — Transfer Data to Cloud Account	CloudTrail + VPC Flow — PART	NSG Flow + Storage Diag — PART	VPC Flow + Cloud Audit — PART
T1580 — Cloud Infrastructure Discovery	CloudTrail Read Events — FULL	Activity Log (read ops) — FULL	Cloud Audit (Data Access) — PART
T1538 — Cloud Service Dashboard	CloudTrail Console events — FULL	Sign-In + Activity Log — FULL	Cloud Audit + Sign-In — FULL
T1578 — Modify Cloud Compute Infrastructure	CloudTrail (RunInstances, ModifyImage) — FULL	Activity Log (VM Write) — FULL	Cloud Audit (compute.instances) — FULL
T1496 — Resource Hijacking (crypto-mining)	VPC Flow + GuardDuty — PART	NSG Flow + Defender — PART	VPC Flow + SCC — PART
T1525 — Implant Internal Image	CloudTrail (PutImage, ECR) — FULL	ACR audit + Activity Log — PART	Artifact Registry audit — FULL
T1190 — Exploit Public-Facing Application	WAF + ALB access + CloudTrail — PART	App Gateway WAF + Front Door — PART	Cloud Armor + LB logs — PART
T1556.007 — Hybrid Identity (MFA bypass)	SSO + CloudTrail — PART	AAD Sign-In + Conditional Access — FULL	Cloud Identity + Audit — PART
T1606.002 — SAML Token Forgery (Golden SAML)	CloudTrail + IdP logs — PART	AAD Sign-In (token issuer) — FULL	Workspace + Audit — PART
T1535 — Unused / Unsupported Cloud Regions	CloudTrail (region field) — FULL	Activity Log (location) — FULL	Cloud Audit (location) — FULL

Downloadable asset — CSV

Full Cloud Log Coverage Matrix

The expanded matrix includes all 60+ MITRE Cloud Matrix sub-techniques, the specific event name per cloud, and the field-level dependency — the version we walk customers through during platform onboarding.

Download CSV →

What Happens When the Logs Are Missing

The abstract argument for full log coverage is unconvincing until you walk through what blind spots actually let through. Three concrete attack scenarios — each based on patterns observed across recent cloud incident response engagements:

Scenario 01 — CloudTrail data events disabled

The silent S3 exfiltration

An adversary obtains long-lived access keys from a compromised developer laptop. They use the keys to enumerate S3 buckets via ListBuckets (visible in CloudTrail management events) and then read 240GB of intellectual property via GetObject across 18,000 objects. Because S3 data events are not enabled on the bucket (default state, and the per-request pricing is the deterrent), none of the 18,000 reads appear in the SIEM. The ListBuckets call gets flagged by a generic recon rule but is dismissed by triage as routine developer activity. The exfiltration is detected six months later via a third-party data leak notification.

Impact — 240GB IP exfiltration · 6-month detection delay · root-cause is one disabled log toggle

Scenario 02 — Sign-in logs not aggregated to SIEM

Token replay across tenants

Adversary phishes a contractor’s session cookie for a corporate Microsoft 365 tenant. They replay the token from a residential proxy IP across 14 different cloud applications. Each individual sign-in falls below the Conditional Access risk threshold because they pass MFA via the captured session. Azure AD Sign-In Logs would show the impossible-velocity pattern (one country at 14:02, another at 14:09) — but the customer routes only “interactive” sign-ins to the SIEM and the captured-session reuse is categorised as “non-interactive.” The token is active for 11 days before re-auth. The blind spot was a single sign-in event type filter.

Impact — 11-day persistent session · 14 SaaS applications reached · filter on log ingestion ruleset

Scenario 03 — VPC Flow Logs sampled at 1:1000

The east-west crypto-miner that never showed up

An attacker exploits a public-facing application (T1190), pivots to an internal Kubernetes node, and deploys a Monero miner that connects east-west to a coordinator pod and outbound to a mining pool over port 443 with TLS. VPC Flow Logs are enabled but configured at 1:1000 sampling to control cost. The miner generates roughly 600 connection records per hour to the pool — statistically below the sampling threshold for triggering volumetric alerts. Bills go up 11% over six weeks before finance escalates.

Impact — 11% compute bill inflation · 6-week dwell · sampling rate chosen for cost, not detection

Cost vs Security Trade-off: Order of Magnitude

The argument “we cannot afford comprehensive logs” almost always comes from a misallocation rather than a true budget ceiling. The scatter plot below shows the order-of-magnitude cost per terabyte ingested for the major cloud log sources, alongside the coverage they deliver against the MITRE Cloud Matrix.

Cost vs coverage scatter plot for cloud log sources

Figure 2 — The efficient frontier favours identity + management-plane audit over network telemetry

Order-of-magnitude pricing (USD, planning estimates):

CloudTrail management events: Approximately $80 per TB stored, free for the first copy to S3. The cheapest high-coverage source in cloud.
Identity sign-in logs: Approximately $220 per TB for AAD diagnostic export or AWS SSO via CloudTrail. Highest coverage-per-dollar in the portfolio.
VPC Flow Logs: Approximately $3,100 per TB when shipped to a SIEM at full fidelity, depending on per-GB ingest rates of the destination platform. Coverage is narrow.
CloudTrail data events (S3 / Lambda): Approximately $2,800 per TB and they scale linearly with object operations. Selective enablement only.
Full packet mirroring (VPC Traffic Mirroring / Azure vTAP): Approximately $18,000 per TB at hyperscale. Justified only for high-value workloads or regulatory mandates.

The implication is direct. If you have $500,000 a year for cloud telemetry, you do not buy 167 TB of VPC Flow Logs — you buy 6.2 PB of CloudTrail or 2.3 PB of identity logs. Network telemetry is a tactical addition for specific TTPs, not a baseline.

Stop paying for logs you cannot hunt

The platform’s coverage analyzer models your current ingest spend against detection coverage and recommends which logs to add, which to trim, and which to push to cold storage.

Model your ingest →

The Minimum Viable Log Set for Cloud Hunting

If you start a cloud hunting program from zero, this is the order in which to onboard log sources. Each tier delivers coverage that the next tier compounds.

Tier 1 — The non-negotiables (week one)

Management-plane audit: CloudTrail (all regions, including the ones you do not use), Azure Activity Log (subscription level), GCP Cloud Audit Logs (Admin Activity, all projects). Retain hot for 90 days, cold-searchable for 12 months.
Identity sign-in: AWS Identity Center sign-in events via CloudTrail, Azure AD / Entra ID Sign-In Logs (both interactive and non-interactive), GCP Cloud Identity audit. Pair with the identity provider in front (Okta, Ping, Auth0) — same retention.
DNS query logs: Route 53 Resolver query logs, Azure DNS analytics, GCP Cloud DNS logging. Cheap, ubiquitously useful for C2 detection.

Tier 2 — The targeted additions (month one)

CloudTrail data events on crown-jewel S3 buckets and Lambda functions only. Tag the buckets, route only tagged events to ingest.
WAF and load-balancer access logs for any internet-facing application. Tier-2 retention (30 days hot, 90 cold).
Kubernetes audit logs at the API-server level if you run EKS / AKS / GKE.

Tier 3 — The justified extensions (quarter one)

VPC Flow Logs with custom fields enabled (TCP flags, packet size, pkt-src-aws-service) for the subnets running production workloads. Sample only on pre-prod.
GuardDuty / Defender / SCC findings as enrichment, not as primary detection logic.
Container runtime logs (Falco, Sysdig) for high-trust workloads.

Example: East-West Hunt via Athena on VPC Flow Logs

Once VPC Flow Logs are in place (whether shipped to a SIEM or, more economically, kept in S3 and queried via Athena), the workhorse hunt is east-west anomaly detection. The query below identifies instances initiating connections to internal IPs they have never connected to before within a 14-day baseline — a high-signal indicator of lateral movement or compromised-instance reconnaissance.

Athena SQL — East-West Anomaly Hunt (VPC Flow Logs)

-- Hypothesis: lateral movement from a compromised EC2 instance manifests
-- as connections to internal /16 destinations the source has never reached.
-- Baseline window: 14 days. Detection window: most recent 6 hours.

WITH baseline AS (
  SELECT
    srcaddr,
    dstaddr
  FROM   vpc_flow_logs
  WHERE  start_time BETWEEN DATE_ADD('day', -14, CURRENT_TIMESTAMP)
                        AND DATE_ADD('hour',  -6, CURRENT_TIMESTAMP)
    AND  action  = 'ACCEPT'
    AND  regexp_like(dstaddr, '^(10\.|172\.(1[6-9]|2[0-9]|3[01])\.|192\.168\.)')
  GROUP  BY 1, 2
),
recent AS (
  SELECT
    srcaddr,
    dstaddr,
    dstport,
    SUM(bytes)     AS bytes_out,
    SUM(packets)   AS packet_count,
    COUNT(*)       AS flow_count,
    MIN(start_time) AS first_seen,
    MAX(end_time)   AS last_seen
  FROM   vpc_flow_logs
  WHERE  start_time >= DATE_ADD('hour', -6, CURRENT_TIMESTAMP)
    AND  action  = 'ACCEPT'
    AND  regexp_like(dstaddr, '^(10\.|172\.(1[6-9]|2[0-9]|3[01])\.|192\.168\.)')
  GROUP  BY 1, 2, 3
)
SELECT
  r.srcaddr           AS source_eni,
  r.dstaddr           AS new_internal_dest,
  r.dstport,
  r.flow_count,
  r.bytes_out,
  r.first_seen,
  r.last_seen,
  -- enrichment: how many *other* new dests from same source?
  COUNT(*) OVER (PARTITION BY r.srcaddr) AS fanout_score
FROM   recent r
LEFT   JOIN baseline b
  ON   r.srcaddr = b.srcaddr AND r.dstaddr = b.dstaddr
WHERE  b.srcaddr IS NULL                -- destination never seen in baseline
  AND  r.dstport NOT IN (53, 123)       -- exclude DNS, NTP noise
  AND  r.flow_count >= 3                -- ignore one-shot probes
ORDER  BY fanout_score DESC, r.bytes_out DESC
LIMIT  500;

The query has three properties that matter. It is cost-bounded because Athena scans only the partitioned 14-day window. It is precision-tuned by excluding DNS/NTP and requiring at least three flows. And it includes a fanout score that surfaces compromised hosts touching many novel internal IPs — the canonical lateral-movement footprint.

This article is part of the Cloud Threat Hunting series. The log set described here is the input to the hunting workflow; the techniques you hunt for come from threat intelligence; the detection logic you author from those hunts becomes part of your detection engineering library; and the coverage you achieve across the MITRE Cloud Matrix is tracked in your MITRE coverage dashboard. The platform architecture page describes how those pipelines connect end-to-end.

Continue the series

FAQ

What is the minimum viable log set for cloud threat hunting?

Management-plane audit logs (CloudTrail, Azure Activity, GCP Audit), identity sign-in logs (IAM, Entra ID, Cloud Identity), and DNS query logs. These three sources cover roughly 70-80% of MITRE Cloud Matrix techniques at moderate cost. VPC Flow Logs and storage data events are second-tier additions justified by specific TTPs you are hunting.

Why do most cloud threat hunting programs fail?

Three reasons: they ingest the wrong logs (network-heavy, identity-light), they ingest the right logs but discard the wrong fields (no userAgent, no sourceIPAddress, truncated requestParameters), and they retain the right logs for the wrong window (30 days when adversary dwell time is 90+ days).

Are VPC Flow Logs worth the cost?

Only if you have a TTP-driven reason: lateral movement hunts, crypto-mining detection, data egress analysis. For most cloud threat hunting use cases, identity logs and CloudTrail deliver higher detection value per dollar. Flow logs are a targeted addition, not a baseline. If you do enable them, use custom field formats to keep TCP flags and AWS-service tags.

How long should cloud logs be retained for hunting?

Industry median dwell time is 84-110 days across recent incident response reporting. Hot retention of 90 days for management and identity logs, plus cold (queryable) retention of 365 days, is the practical floor. Less than that, and you cannot reconstruct multi-month persistence campaigns. The cost of cold storage is roughly 1/20th of hot, so the budget penalty is small.

What is the difference between CloudTrail management events and data events?

Management events record API calls that configure resources (CreateUser, AssumeRole, PutBucketPolicy). Data events record operations on the resources themselves (GetObject, PutObject, Invoke). Management events are baseline and effectively free for the first trail; data events are high-volume, costly, and should be selectively enabled per sensitive resource (crown-jewel S3 buckets, security-critical Lambdas).

Can SaaS detection products replace direct log access?

No. Vendor detections operate on a subset of fields and reasoning paths chosen by the vendor. For threat hunting — the open-ended, hypothesis-driven workflow — you need raw or near-raw access to the same telemetry. SaaS detections are a complement to a hunting workflow, not a substitute. Treat them as one signal among many.

How do I justify the budget for these logs to finance?

Frame it as a coverage portfolio with a measurable output: percentage of MITRE Cloud Matrix techniques detectable from the current log set. Show the gap to a target (typically 75-80% coverage of priority techniques). Pair the budget ask with a retirement plan for redundant or low-value logs — finance accepts log spend more readily when it is paired with cuts elsewhere.

Operate the workflow

From the log set on this page to a hunting workflow on Monday

The platform automates the coverage analysis described above, ships the Athena queries pre-built against your CloudTrail bucket, and tracks detection precision per technique over time. It is the operating system for the hunt cycle — not another vendor detection feed.

Launch the platform →

Forensics and Cyber Threat Research Area

What Cloud Logs You Actually Need for Threat Hunting (And Why Most Teams Fail)

What Cloud Logs You Actually Need for Threat Hunting (And Why Most Teams Fail)