Detection Engineering · Operator Playbook
If your detection portfolio is a black box, every conversation about coverage, hiring, tooling and budget becomes a guess. This piece is the metrics layer underneath that conversation: the formulas that matter, the math behind alert fatigue, and a scorecard you can run against your SIEM today.
You cannot improve what you do not measure. Most detection programmes ship rules, then forget them. The portfolios that actually catch adversaries treat detections as a measurable product: precision, recall, retirement candidates, ATT&CK saturation, mean-time-to-detect. This piece walks through the metrics, the formulas, and a working scorecard you can use this week. Want the live version? Open HuntIntel →
Why measure detection quality at all
A detection rule is software. It has a runtime cost (telemetry, compute, analyst time), a failure mode (false positives, missed truth), and a lifespan (the adversary technique it watches eventually mutates). Most security teams ship rules but never instrument them. The result is a portfolio that grows monotonically, alert volume that grows quadratically, and a vague feeling that “coverage is improving” with no evidence to back it up.
Detection engineering is the discipline of treating rules as a measurable product. That means every detection ships with metrics, every rule has a kill switch, and every quarter the portfolio is pruned. The metrics below are the minimum viable instrument panel — none of them are negotiable.
For the broader programme context see our overview of detection engineering and the corresponding MITRE ATT&CK coverage view.
Precision, recall, F1 — what they actually mean for detections
Borrowed from information retrieval, these three metrics translate cleanly to detections once you fix the vocabulary. Every alert is either a true positive (TP, real malicious activity), a false positive (FP, benign activity that tripped the rule), or a false negative (FN, real activity the rule missed). True negatives are uncountable — every uneventful second of telemetry is technically a TN — so we ignore them.
Precision
Precision answers: when the rule fires, how often is it right?
Precision = TP / (TP + FP)
Concrete: a PowerShell encoded-command rule fires 200 times in 90 days. Analysts confirm 184 were tied to real incidents and 16 were benign automation. Precision = 184 / (184 + 16) = 0.92. That is a strong rule. Compare to a generic “suspicious base64 in command line” rule firing 4,800 times with 31 confirmed TPs: precision = 31 / 4800 = 0.006. Same telemetry source, two orders of magnitude apart on quality.
Recall
Recall answers: of all the real attacks of this type, what fraction did we catch?
Recall = TP / (TP + FN)
This is the hard one. You can only count attacks you discovered. The honest path is to measure recall against three controlled inputs: purple-team exercises with known ground truth, red-team retros where ATT&CK techniques are mapped to alerts that should have fired, and historical incidents replayed in a staging SIEM. Concrete: in a purple-team round of 25 emulated T1078 (Valid Accounts) scenarios, your rule catches 11. Recall = 11 / 25 = 0.44. Now you have an actionable number, not a feeling.
F1 score
F1 is the harmonic mean of precision and recall — it punishes you for being good at one and bad at the other.
F1 = 2 × (precision × recall) / (precision + recall)
For the PowerShell rule above (precision 0.92, suppose recall measured at 0.70 in lab): F1 = 2 × (0.92 × 0.70) / (0.92 + 0.70) = 0.795. F1 is useful for tracking a single rule across versions. It is misleading for portfolio comparisons because it treats high precision and high recall as equally valuable — in production, an FP storm is usually more expensive than an FN, so weight accordingly.
Coverage vs effectiveness — why a detection can be “covered” but useless
Coverage is a binary on a matrix: do you have at least one rule mapped to ATT&CK technique T1059? Effectiveness is a continuous measurement: does that rule fire reliably against the variants real adversaries use? The two diverge sharply in practice.
Three failure modes hide inside the green cells of a coverage heat map:
- Narrow signature — a rule that triggers on one literal command line and misses every actor who reads
about_PowerShell_Encodedon Microsoft Learn. - Dependency drift — the rule relies on a telemetry field that the EDR vendor renamed or stopped populating six months ago.
- Suppression rot — exclusions added during noisy weeks have grown into a Swiss-cheese pattern that now hides the very behaviour the rule was meant to catch.
The fix is to instrument every “covered” technique with at least one validation event per quarter — synthetic, atomic-red-team, or breach-and-attack-simulation — and to display the validation result next to the coverage cell. Anything that has not been validated this quarter is effectively grey.
Alert fatigue: the math behind it
Alert fatigue is usually discussed as a morale problem. It is actually a Bayesian base-rate problem with a predictable arithmetic outcome. Here is the version that matters when you are sizing a SOC.
The base-rate setup
Suppose your environment has 1,000,000 sessions per day. Of those, on average 10 are malicious — a base rate of 0.001%. You deploy a rule with 99% precision in lab (TP / (TP + FP) = 0.99) and 99% recall in lab. Most leaders read those numbers and feel good. Let us run the math.
Out of 10 malicious sessions, recall 0.99 catches 9.9 ≈ 10 TPs. Out of 999,990 benign sessions, a lab-measured precision of 0.99 corresponds to a false positive rate of roughly 1% (more on this below). 1% of 999,990 is 9,999 FPs. So a rule with lab-grade 99% on both metrics produces, in production, 10 true positives and 9,999 false positives per day. The actual precision in production is 10 / (10 + 9,999) = 0.001 — a thousand times worse than the lab number.
Why this happens
Precision is base-rate dependent. Lab precision measures the rule against a balanced test set. Production precision measures the rule against the real prior, which is overwhelmingly benign. Bayes’ rule formalises this:
P(real | alert) = P(alert | real) × P(real) / P(alert)
Plug the numbers: P(alert | real) = 0.99, P(real) = 10/1,000,000 = 0.00001, P(alert) ≈ 0.01. P(real | alert) = (0.99 × 0.00001) / 0.01 = 0.00099. Out of every 1,000 alerts, one is real. That is the math behind “the analyst clicked through and there was nothing there.”
What to do about it
- Raise the rule’s prior. Constrain the rule to high-risk surfaces — domain controllers, privileged identities, internet-exposed hosts. You are shrinking the denominator, which raises the achievable precision.
- Stack signals. Two independent weak signals combined often yield acceptable precision when neither alone does. This is the principle behind enriched threat intelligence joined to behavioural detection.
- Reduce false-positive rate, not just precision. A 10× drop in FP rate moves production precision from 0.001 to 0.01 — still bad in absolute terms but ten times less analyst burn.
- Set a queue-depth budget. An L1 analyst can sustain 30–50 meaningful triage decisions per shift. Past ~200 alerts per analyst-day, miss rates climb sharply. Treat queue depth as a SLO; retire noisy rules when it exceeds budget.
ATT&CK technique saturation
Once you map rules to ATT&CK techniques, you can spot two opposite pathologies in the same view: over-saturation (twenty rules on T1059, all firing on overlapping conditions, producing alert storms with no extra coverage) and under-saturation (zero rules on T1530 Cloud Storage Object Discovery despite a cloud-heavy estate).
The heat-map below is a representative pattern from a real portfolio. Dark cells are over-saturated techniques; pale cells are blind spots. The diagonal is roughly what you would expect — most portfolios cluster around execution and defence evasion because those are where vendor content is densest. The interesting cells are the unexpected ones: an exfiltration column with zero coverage in a data-rich environment is a strategic risk that the line-level metrics will never surface.
The remediation pattern is unglamorous: for over-saturated techniques, run a deduplication pass — most rules will turn out to be slight variants of the same logic and can be merged or retired without losing coverage. For blind spots, write one rule, instrument it with the validation framework, and let it bake for 30 days before extending. Resist the urge to fill the whole matrix; an unowned rule is worse than a missing one.
Sample detection scorecard
The scorecard is the single artefact that pulls everything together: per-rule precision proxy (fp_rate), volume (tp_count), latency (mttd, mttr), and the verdict (keep, tune, retire). This is what should sit on every detection-engineering team’s wall.
| rule_id | technique | fp_rate | tp_count (90d) | mttd | mttr | retirement_candidate |
|---|---|---|---|---|---|---|
| HFL-EDR-0041 | T1059.001 PowerShell | 0.04 | 312 | 2 min | 27 min | No |
| HFL-EDR-0102 | T1078 Valid Accounts | 0.61 | 9 | 14 min | 3 h 12 m | Yes |
| HFL-NDR-0027 | T1071.001 Web Protocols | 0.18 | 74 | 6 min | 52 min | Tune |
| HFL-CLOUD-0014 | T1098 Account Manipulation | 0.09 | 41 | 4 min | 38 min | No |
| HFL-EDR-0203 | T1027 Obfuscated Files | 0.83 | 3 | 22 min | 5 h 41 m | Yes |
| HFL-CLOUD-0061 | T1530 Cloud Storage Object | 0.07 | 118 | 3 min | 19 min | No |
| HFL-EDR-0118 | T1055 Process Injection | 0.31 | 22 | 9 min | 1 h 04 m | Tune |
| HFL-IAM-0009 | T1110.003 Password Spraying | 0.12 | 67 | 5 min | 41 min | No |
Two rows in this scorecard tell the whole story. HFL-EDR-0102 (T1078 Valid Accounts) has a 61% false-positive rate and a 3-hour MTTR — it is burning analyst time and producing almost no incidents. Retire it and replace it with a stacked detection that joins identity risk to session anomaly. HFL-EDR-0203 (T1027 Obfuscated Files) is worse: 83% FP rate, 3 TPs in 90 days, MTTR over five hours. There is no tuning that saves this rule; it should be retired and rewritten from scratch with a tighter behavioural prior.
Compute precision from SIEM history — SQL and KQL
Below is the query you can run against most SIEM data lakes (Postgres / Snowflake / BigQuery flavour) to compute the per-rule view above. The KQL equivalent for Microsoft Sentinel and Defender XDR is in the comment block at the bottom.
-- Compute per-rule precision from 90 days of SIEM alert history.
-- Assumes you have an 'alerts' table joined to 'incidents' table.
-- An alert is a TP if it was linked to a confirmed incident.
WITH rule_stats AS (
SELECT
a.rule_id,
a.rule_name,
a.attack_technique,
COUNT(*) AS total_alerts,
SUM(CASE WHEN i.disposition = 'true_positive' THEN 1 ELSE 0 END) AS tp,
SUM(CASE WHEN i.disposition = 'false_positive' THEN 1 ELSE 0 END) AS fp,
SUM(CASE WHEN i.disposition = 'benign_true_positive' THEN 1 ELSE 0 END) AS btp,
AVG(EXTRACT(EPOCH FROM (a.alerted_at - a.event_time)))/60.0 AS mttd_minutes,
AVG(EXTRACT(EPOCH FROM (i.closed_at - a.alerted_at)))/60.0 AS mttr_minutes
FROM alerts a
LEFT JOIN incidents i ON i.alert_id = a.alert_id
WHERE a.alerted_at >= NOW() - INTERVAL '90 days'
GROUP BY a.rule_id, a.rule_name, a.attack_technique
)
SELECT
rule_id,
rule_name,
attack_technique,
total_alerts,
tp,
fp,
ROUND(tp::numeric / NULLIF(tp + fp, 0), 3) AS precision,
ROUND(fp::numeric / NULLIF(total_alerts, 0), 3) AS fp_rate,
ROUND(mttd_minutes::numeric, 1) AS mttd_min,
ROUND(mttr_minutes::numeric, 1) AS mttr_min,
CASE
WHEN tp = 0 AND total_alerts > 20 THEN 'retire'
WHEN tp::numeric / NULLIF(tp + fp, 0) < 0.10 THEN 'retire'
WHEN tp::numeric / NULLIF(tp + fp, 0) < 0.40 THEN 'tune'
ELSE 'keep'
END AS verdict
FROM rule_stats
ORDER BY fp DESC, total_alerts DESC;
-- KQL equivalent (Microsoft Sentinel / Defender XDR):
-- SecurityAlert
-- | where TimeGenerated > ago(90d)
-- | join kind=leftouter (SecurityIncident) on $left.SystemAlertId == $right.AlertIds
-- | summarize tp = countif(Disposition == "TruePositive"),
-- fp = countif(Disposition == "FalsePositive"),
-- total = count() by AlertName, Tactics, Techniques
-- | extend precision = round(todouble(tp) / iff(tp+fp==0, 1, tp+fp), 3)
-- | order by fp desc
Two notes on running this honestly. First, the join to incidents assumes analysts dispositioned every alert — if a large fraction is left as “new” or “in progress,” your precision number is upward-biased because unhandled alerts skew toward FPs. Filter to dispositioned alerts only. Second, the benign_true_positive bucket (legitimate activity that the rule correctly identified but which turned out to be authorised) matters for tuning — it tells you the rule logic is sound but the scope is wrong.
Operating cadence — how to actually use the scorecard
Metrics without a forcing function decay into shelfware. The cadence that works:
- Weekly — review new rules in the 30-day bake-in window. Anything with precision under 0.30 or fp_rate above 0.50 gets tuned or rolled back.
- Monthly — full portfolio scorecard reviewed by the detection-engineering lead. Anything tagged
retirein the verdict column is queued for sunset. Anything taggedtunegets an owner and a two-week deadline. - Quarterly — ATT&CK saturation review with the threat-intel team. Blind spots are prioritised against the current threat-actor landscape, not the matrix as a whole.
- Annually — sunset anything that has not been touched and has fired zero TPs in 12 months. Coverage that does not catch anything is fictional coverage.
For cloud-native portfolios this cadence pairs naturally with the cloud threat-hunting series — many of the “retire” candidates in cloud estates are on-prem rules that never made the transition cleanly and should be replaced rather than tuned.
Downloadable scorecard template
If you want this generated automatically against a live SIEM, the same artefact ships inside HuntIntel — see the platform architecture for how the rule-quality module joins your alert history to incident dispositions.
FAQ
What is a good precision target for a detection rule?
For high-signal detections (credential theft on a domain controller, ransomware staging) you should expect 0.85 or higher. For broad behavioural rules (PowerShell obfuscation, suspicious DNS) anything above 0.40 is workable provided the analyst workflow can triage cheaply. The right answer is set by the cost of a false positive: if a single FP burns 30 minutes of analyst time and the rule fires 200 times a month, even 0.30 precision is corrosive.
Why is recall so hard to measure?
Because you only see attacks you detected. The denominator — “all real attacks” — is unknowable in production. Three workable proxies: (1) purple-team exercises with known ground truth, (2) red-team callouts replayed against the SIEM, and (3) cross-source corroboration where one telemetry source catches what another missed. Treat recall as a directional metric, not an absolute number.
Should I use F1 score or something else?
F1 is fine for comparing two versions of the same rule. It is misleading when comparing rules across very different base rates. For portfolio-level reporting, prefer a weighted blend that includes false-positive rate, analyst minutes burned per alert, and incident severity recovered. F1 alone will rank a noisy high-recall rule above a quiet high-precision one.
How often should I review the scorecard?
Monthly for portfolio-level review, weekly for new rules in the bake-in period (first 30 days), and continuously for anything tagged “tune” or “retire.” Anything that has not been touched in 12 months is a candidate for retirement regardless of metrics — the threat surface it covered has almost certainly drifted.
What is the alert-fatigue threshold?
Empirically, an L1 analyst can sustain about 30 to 50 meaningful triage decisions per shift before quality drops. Once a queue routinely exceeds 200 alerts per analyst-day, miss rates climb sharply and time-to-acknowledge balloons. Use that as your forcing function: when queue depth crosses the threshold, retire or tune the noisiest contributors rather than hiring more analysts.
Does coverage of an ATT&CK technique mean we are protected?
No. Coverage means at least one rule references the technique. Effectiveness is whether that rule actually catches the technique under realistic conditions. A T1059 rule that only fires on powershell.exe -enc covers PowerShell execution on paper and misses every actor who has read a defender blog post. Pair every coverage report with a validation result.
How does this fit with MITRE ATT&CK Evaluations?
ATT&CK Evaluations are a great external benchmark of vendor capability, but they tell you nothing about your own rule portfolio. The metrics here are about your detections in your environment against your adversaries. Use Evaluations to choose tooling; use this scorecard to manage what you have built on top.
Score your detection portfolio in one afternoon.
HuntIntel ingests your rule inventory, joins it to your SIEM history, and produces a per-rule scorecard with precision, recall, MTTD, MTTR and retirement candidates. Operator-grade, no professional services required.










