What Cloud Logs You Actually Need for Threat Hunting (And Why Most Teams Fail)
Cloud telemetry is not a checkbox. It is a portfolio decision with cost, coverage, and retention as the three axes. Get the portfolio wrong and your hunts will produce empty result sets against real intrusions. This is the operator’s guide to the log set that actually works.
Why Most Cloud Threat Hunting Programs Fail
The pattern is so consistent we have a shorthand for it internally: logs-rich, hunt-poor. The customer is paying seven figures a year for log ingest. Their SIEM contains terabytes a day. And yet, when an incident hits, the hunter cannot answer the questions that matter: which identity assumed the role at 02:14 UTC, what role chain did they walk, which object did they read.
Three failure modes recur:
- Wrong logs. Teams ingest VPC Flow Logs (cheap-per-flow, expensive-per-TB) and skip identity sign-in events (the actual epicentre of cloud attack chains). The result is network-heavy, identity-blind telemetry.
- Right logs, wrong fields. The team ingests CloudTrail but discards
requestParameters,userAgent, orsessionContextto save space. The hunt query that needed those fields returns nothing — not because the activity did not happen, but because the evidence was thrown away upstream. - Right logs, wrong window. Mandiant’s median dwell time has been 84-110 days for years. If your hot retention is 30 days and your cold retention is searchable only via an offline restore, your hunt cannot span the actual lifespan of the intrusion.
Each of those failure modes costs nothing to fix at architecture time and costs a forensic engagement to discover after the fact.
See your own log coverage gaps
The platform maps your ingested log set against the MITRE Cloud Matrix and shows you which techniques you cannot detect today.
The Log Dependency Map: Technique → Log → Cost
Every detection or hunt query has a log dependency. If the log is not ingested with the right fields and retention, the detection does not exist — it is a stub. The map below shows the relationship between a sample of cloud-relevant ATT&CK techniques and the log source required to hunt them.
Three observations matter for budgeting. First, identity logs (sign-in, IAM) and the management-plane audit log (CloudTrail / Activity Log / GCP Audit) carry the heaviest load — they are required for most persistence, privilege escalation, and lateral-movement techniques. Second, VPC Flow Logs cover surprisingly few cloud-native techniques on their own. They are essential for east-west crypto-mining or beacon detection, but for cloud-control-plane attacks they are largely orthogonal. Third, storage data events (S3, Blob, GCS) are the most expensive single source — enabling them globally is almost never the right call. Enable per-bucket, scoped to crown-jewel data.
AWS / Azure / GCP Coverage Matrix
The cross-cloud coverage matrix below maps a selection of MITRE ATT&CK Cloud Matrix techniques to the equivalent log sources in AWS, Azure, and GCP. Cell values indicate practical coverage: FULL means the log routinely contains the field needed to detect the technique; PART means coverage requires combining multiple log streams; NONE means the technique cannot be detected from this source.
| Technique (ID) | AWS | Azure | GCP |
|---|---|---|---|
| T1078.004 — Cloud Accounts (valid creds) | CloudTrail + IAM — FULL | Sign-In Logs + AAD Audit — FULL | Cloud Audit (Admin Activity) — FULL |
| T1098.001 — Additional Cloud Credentials | CloudTrail (CreateAccessKey) — FULL | AAD Audit (Application credential) — FULL | Cloud Audit (CreateServiceAccountKey) — FULL |
| T1098.003 — Additional Cloud Roles | CloudTrail (AttachRolePolicy) — FULL | AAD Audit (Add member to role) — FULL | Cloud Audit (SetIamPolicy) — FULL |
| T1199 — Trusted Relationship (cross-account) | CloudTrail (AssumeRole, sts:ExternalId) — FULL | Activity Log (delegated permissions) — PART | Cloud Audit (Workload Identity Federation) — PART |
| T1530 — Data from Cloud Storage Object | S3 Data Events (paid) — PART | Storage Diagnostics Logs — PART | Cloud Audit (Data Access, opt-in) — PART |
| T1537 — Transfer Data to Cloud Account | CloudTrail + VPC Flow — PART | NSG Flow + Storage Diag — PART | VPC Flow + Cloud Audit — PART |
| T1580 — Cloud Infrastructure Discovery | CloudTrail Read Events — FULL | Activity Log (read ops) — FULL | Cloud Audit (Data Access) — PART |
| T1538 — Cloud Service Dashboard | CloudTrail Console events — FULL | Sign-In + Activity Log — FULL | Cloud Audit + Sign-In — FULL |
| T1578 — Modify Cloud Compute Infrastructure | CloudTrail (RunInstances, ModifyImage) — FULL | Activity Log (VM Write) — FULL | Cloud Audit (compute.instances) — FULL |
| T1496 — Resource Hijacking (crypto-mining) | VPC Flow + GuardDuty — PART | NSG Flow + Defender — PART | VPC Flow + SCC — PART |
| T1525 — Implant Internal Image | CloudTrail (PutImage, ECR) — FULL | ACR audit + Activity Log — PART | Artifact Registry audit — FULL |
| T1190 — Exploit Public-Facing Application | WAF + ALB access + CloudTrail — PART | App Gateway WAF + Front Door — PART | Cloud Armor + LB logs — PART |
| T1556.007 — Hybrid Identity (MFA bypass) | SSO + CloudTrail — PART | AAD Sign-In + Conditional Access — FULL | Cloud Identity + Audit — PART |
| T1606.002 — SAML Token Forgery (Golden SAML) | CloudTrail + IdP logs — PART | AAD Sign-In (token issuer) — FULL | Workspace + Audit — PART |
| T1535 — Unused / Unsupported Cloud Regions | CloudTrail (region field) — FULL | Activity Log (location) — FULL | Cloud Audit (location) — FULL |
The expanded matrix includes all 60+ MITRE Cloud Matrix sub-techniques, the specific event name per cloud, and the field-level dependency — the version we walk customers through during platform onboarding.
What Happens When the Logs Are Missing
The abstract argument for full log coverage is unconvincing until you walk through what blind spots actually let through. Three concrete attack scenarios — each based on patterns observed across recent cloud incident response engagements:
The silent S3 exfiltration
An adversary obtains long-lived access keys from a compromised developer laptop. They use the keys to enumerate S3 buckets via ListBuckets (visible in CloudTrail management events) and then read 240GB of intellectual property via GetObject across 18,000 objects. Because S3 data events are not enabled on the bucket (default state, and the per-request pricing is the deterrent), none of the 18,000 reads appear in the SIEM. The ListBuckets call gets flagged by a generic recon rule but is dismissed by triage as routine developer activity. The exfiltration is detected six months later via a third-party data leak notification.
Impact — 240GB IP exfiltration · 6-month detection delay · root-cause is one disabled log toggle
Token replay across tenants
Adversary phishes a contractor’s session cookie for a corporate Microsoft 365 tenant. They replay the token from a residential proxy IP across 14 different cloud applications. Each individual sign-in falls below the Conditional Access risk threshold because they pass MFA via the captured session. Azure AD Sign-In Logs would show the impossible-velocity pattern (one country at 14:02, another at 14:09) — but the customer routes only “interactive” sign-ins to the SIEM and the captured-session reuse is categorised as “non-interactive.” The token is active for 11 days before re-auth. The blind spot was a single sign-in event type filter.
Impact — 11-day persistent session · 14 SaaS applications reached · filter on log ingestion ruleset
The east-west crypto-miner that never showed up
An attacker exploits a public-facing application (T1190), pivots to an internal Kubernetes node, and deploys a Monero miner that connects east-west to a coordinator pod and outbound to a mining pool over port 443 with TLS. VPC Flow Logs are enabled but configured at 1:1000 sampling to control cost. The miner generates roughly 600 connection records per hour to the pool — statistically below the sampling threshold for triggering volumetric alerts. Bills go up 11% over six weeks before finance escalates.
Impact — 11% compute bill inflation · 6-week dwell · sampling rate chosen for cost, not detection
Cost vs Security Trade-off: Order of Magnitude
The argument “we cannot afford comprehensive logs” almost always comes from a misallocation rather than a true budget ceiling. The scatter plot below shows the order-of-magnitude cost per terabyte ingested for the major cloud log sources, alongside the coverage they deliver against the MITRE Cloud Matrix.
Order-of-magnitude pricing (USD, planning estimates):
- CloudTrail management events: Approximately $80 per TB stored, free for the first copy to S3. The cheapest high-coverage source in cloud.
- Identity sign-in logs: Approximately $220 per TB for AAD diagnostic export or AWS SSO via CloudTrail. Highest coverage-per-dollar in the portfolio.
- VPC Flow Logs: Approximately $3,100 per TB when shipped to a SIEM at full fidelity, depending on per-GB ingest rates of the destination platform. Coverage is narrow.
- CloudTrail data events (S3 / Lambda): Approximately $2,800 per TB and they scale linearly with object operations. Selective enablement only.
- Full packet mirroring (VPC Traffic Mirroring / Azure vTAP): Approximately $18,000 per TB at hyperscale. Justified only for high-value workloads or regulatory mandates.
The implication is direct. If you have $500,000 a year for cloud telemetry, you do not buy 167 TB of VPC Flow Logs — you buy 6.2 PB of CloudTrail or 2.3 PB of identity logs. Network telemetry is a tactical addition for specific TTPs, not a baseline.
Stop paying for logs you cannot hunt
The platform’s coverage analyzer models your current ingest spend against detection coverage and recommends which logs to add, which to trim, and which to push to cold storage.
The Minimum Viable Log Set for Cloud Hunting
If you start a cloud hunting program from zero, this is the order in which to onboard log sources. Each tier delivers coverage that the next tier compounds.
Tier 1 — The non-negotiables (week one)
- Management-plane audit: CloudTrail (all regions, including the ones you do not use), Azure Activity Log (subscription level), GCP Cloud Audit Logs (Admin Activity, all projects). Retain hot for 90 days, cold-searchable for 12 months.
- Identity sign-in: AWS Identity Center sign-in events via CloudTrail, Azure AD / Entra ID Sign-In Logs (both interactive and non-interactive), GCP Cloud Identity audit. Pair with the identity provider in front (Okta, Ping, Auth0) — same retention.
- DNS query logs: Route 53 Resolver query logs, Azure DNS analytics, GCP Cloud DNS logging. Cheap, ubiquitously useful for C2 detection.
Tier 2 — The targeted additions (month one)
- CloudTrail data events on crown-jewel S3 buckets and Lambda functions only. Tag the buckets, route only tagged events to ingest.
- WAF and load-balancer access logs for any internet-facing application. Tier-2 retention (30 days hot, 90 cold).
- Kubernetes audit logs at the API-server level if you run EKS / AKS / GKE.
Tier 3 — The justified extensions (quarter one)
- VPC Flow Logs with custom fields enabled (TCP flags, packet size, pkt-src-aws-service) for the subnets running production workloads. Sample only on pre-prod.
- GuardDuty / Defender / SCC findings as enrichment, not as primary detection logic.
- Container runtime logs (Falco, Sysdig) for high-trust workloads.
Example: East-West Hunt via Athena on VPC Flow Logs
Once VPC Flow Logs are in place (whether shipped to a SIEM or, more economically, kept in S3 and queried via Athena), the workhorse hunt is east-west anomaly detection. The query below identifies instances initiating connections to internal IPs they have never connected to before within a 14-day baseline — a high-signal indicator of lateral movement or compromised-instance reconnaissance.
Athena SQL — East-West Anomaly Hunt (VPC Flow Logs)
-- Hypothesis: lateral movement from a compromised EC2 instance manifests
-- as connections to internal /16 destinations the source has never reached.
-- Baseline window: 14 days. Detection window: most recent 6 hours.
WITH baseline AS (
SELECT
srcaddr,
dstaddr
FROM vpc_flow_logs
WHERE start_time BETWEEN DATE_ADD('day', -14, CURRENT_TIMESTAMP)
AND DATE_ADD('hour', -6, CURRENT_TIMESTAMP)
AND action = 'ACCEPT'
AND regexp_like(dstaddr, '^(10\.|172\.(1[6-9]|2[0-9]|3[01])\.|192\.168\.)')
GROUP BY 1, 2
),
recent AS (
SELECT
srcaddr,
dstaddr,
dstport,
SUM(bytes) AS bytes_out,
SUM(packets) AS packet_count,
COUNT(*) AS flow_count,
MIN(start_time) AS first_seen,
MAX(end_time) AS last_seen
FROM vpc_flow_logs
WHERE start_time >= DATE_ADD('hour', -6, CURRENT_TIMESTAMP)
AND action = 'ACCEPT'
AND regexp_like(dstaddr, '^(10\.|172\.(1[6-9]|2[0-9]|3[01])\.|192\.168\.)')
GROUP BY 1, 2, 3
)
SELECT
r.srcaddr AS source_eni,
r.dstaddr AS new_internal_dest,
r.dstport,
r.flow_count,
r.bytes_out,
r.first_seen,
r.last_seen,
-- enrichment: how many *other* new dests from same source?
COUNT(*) OVER (PARTITION BY r.srcaddr) AS fanout_score
FROM recent r
LEFT JOIN baseline b
ON r.srcaddr = b.srcaddr AND r.dstaddr = b.dstaddr
WHERE b.srcaddr IS NULL -- destination never seen in baseline
AND r.dstport NOT IN (53, 123) -- exclude DNS, NTP noise
AND r.flow_count >= 3 -- ignore one-shot probes
ORDER BY fanout_score DESC, r.bytes_out DESC
LIMIT 500;
The query has three properties that matter. It is cost-bounded because Athena scans only the partitioned 14-day window. It is precision-tuned by excluding DNS/NTP and requiring at least three flows. And it includes a fanout score that surfaces compromised hosts touching many novel internal IPs — the canonical lateral-movement footprint.
Where This Fits in the Hunting Stack
This article is part of the Cloud Threat Hunting series. The log set described here is the input to the hunting workflow; the techniques you hunt for come from threat intelligence; the detection logic you author from those hunts becomes part of your detection engineering library; and the coverage you achieve across the MITRE Cloud Matrix is tracked in your MITRE coverage dashboard. The platform architecture page describes how those pipelines connect end-to-end.
FAQ
What is the minimum viable log set for cloud threat hunting?
Management-plane audit logs (CloudTrail, Azure Activity, GCP Audit), identity sign-in logs (IAM, Entra ID, Cloud Identity), and DNS query logs. These three sources cover roughly 70-80% of MITRE Cloud Matrix techniques at moderate cost. VPC Flow Logs and storage data events are second-tier additions justified by specific TTPs you are hunting.
Why do most cloud threat hunting programs fail?
Three reasons: they ingest the wrong logs (network-heavy, identity-light), they ingest the right logs but discard the wrong fields (no userAgent, no sourceIPAddress, truncated requestParameters), and they retain the right logs for the wrong window (30 days when adversary dwell time is 90+ days).
Are VPC Flow Logs worth the cost?
Only if you have a TTP-driven reason: lateral movement hunts, crypto-mining detection, data egress analysis. For most cloud threat hunting use cases, identity logs and CloudTrail deliver higher detection value per dollar. Flow logs are a targeted addition, not a baseline. If you do enable them, use custom field formats to keep TCP flags and AWS-service tags.
How long should cloud logs be retained for hunting?
Industry median dwell time is 84-110 days across recent incident response reporting. Hot retention of 90 days for management and identity logs, plus cold (queryable) retention of 365 days, is the practical floor. Less than that, and you cannot reconstruct multi-month persistence campaigns. The cost of cold storage is roughly 1/20th of hot, so the budget penalty is small.
What is the difference between CloudTrail management events and data events?
Management events record API calls that configure resources (CreateUser, AssumeRole, PutBucketPolicy). Data events record operations on the resources themselves (GetObject, PutObject, Invoke). Management events are baseline and effectively free for the first trail; data events are high-volume, costly, and should be selectively enabled per sensitive resource (crown-jewel S3 buckets, security-critical Lambdas).
Can SaaS detection products replace direct log access?
No. Vendor detections operate on a subset of fields and reasoning paths chosen by the vendor. For threat hunting — the open-ended, hypothesis-driven workflow — you need raw or near-raw access to the same telemetry. SaaS detections are a complement to a hunting workflow, not a substitute. Treat them as one signal among many.
How do I justify the budget for these logs to finance?
Frame it as a coverage portfolio with a measurable output: percentage of MITRE Cloud Matrix techniques detectable from the current log set. Show the gap to a target (typically 75-80% coverage of priority techniques). Pair the budget ask with a retirement plan for redundant or low-value logs — finance accepts log spend more readily when it is paired with cuts elsewhere.
From the log set on this page to a hunting workflow on Monday
The platform automates the coverage analysis described above, ships the Athena queries pre-built against your CloudTrail bucket, and tracks detection precision per technique over time. It is the operating system for the hunt cycle — not another vendor detection feed.










