Detection Engineering: From Threat Intelligence to Production-Ready Detections

PILLAR / DETECTION ENGINEERING

Detection Engineering: From Threat Intelligence to Production-Ready Detections

A practitioner’s guide to the discipline that sits between threat intelligence and SOC triage — how to take a hypothesis, prove out the data, write the logic, validate against real adversary behavior, and operate it with metrics that survive contact with the alert queue.

Operator note. This page is intentionally tool-agnostic. Sample queries are written in KQL and SPL pseudo-syntax to illustrate logic, not to be copy-pasted into a production tenant. For platform-managed detection content shipped against your own telemetry, the HackForLab platform exposes a curated, ATT&CK-mapped detection library — launch the workbench →
01 / Definition

What Detection Engineering really means — and what it is not

The phrase “detection engineering” is now used loosely enough that it has started to lose meaning. In most organizations it gets applied to anything from writing a Splunk dashboard to enabling a vendor rule pack. That is not the discipline being described here.

Detection engineering, used precisely, is the practice of treating threat detections as engineered software artifacts. A detection has a hypothesis, a data contract, version-controlled logic, a test suite, a service-level objective, and a retirement policy. It is reviewed, measured, and decommissioned. It is owned by an engineer, not by a vendor.

The contrast with SOC alerting

SOC alerting and detection engineering are adjacent but distinct functions. The SOC consumes detections; detection engineering produces them. When the two are conflated, two predictable failures emerge: analyst time is consumed by tuning vendor rules nobody owns, and threat intelligence sits in a TIP with no path into the SIEM.

Dimension SOC Alerting Detection Engineering
Primary artifact Triaged ticket Versioned detection rule
Input Alert from SIEM/EDR Threat hypothesis + telemetry
Output Incident verdict Production detection with SLO
Success metric MTTR, escalation rate Precision, recall, coverage delta
Time horizon Minutes to hours Days to weeks per detection
Skill profile Triage, IR, forensics Adversary emulation, data engineering, query authoring

A mature security org needs both. The detection engineer’s job is to make the SOC analyst’s job possible — high signal, low false-positive rate, and explicit coverage maps so the team knows what they can and cannot see. Without that upstream investment, the SOC drowns in vendor noise and the threat intel team writes reports nobody operationalizes.

Where the discipline borrows from software engineering

The mental model that has held up best in practice is “detections as code.” Specifically:

  • Source control. Every detection lives in Git. Pull requests, code review, blame history, the works. This single change kills more than half of the chronic problems most SOCs have with rule sprawl.
  • Continuous integration. Detection logic is unit-tested against synthetic events on every commit. Atomic Red Team, Caldera, and lab-generated PCAP/log corpora are the test fixtures.
  • Service-level objectives. Each production detection carries an SLO: target precision, expected weekly volume, maximum tolerated MTTD. Breaches trigger a tuning ticket, not silent decay.
  • Sunset policy. Detections that drift below their precision SLO for two consecutive review cycles are retired. The library shrinks as often as it grows. This is non-negotiable.

The single largest cultural shift this requires is accepting that a detection that fires correctly but is no longer useful is technical debt, not an asset. Most SIEMs accumulate thousands of rules nobody owns. Detection engineering is partly the practice of saying no to that accumulation.

02 / Lifecycle

The detection lifecycle: hypothesis → data → logic → validation → metrics

Every production-grade detection moves through five stages. Skipping any of them produces one of the failure modes documented in the next section. The lifecycle is a closed loop — metrics feed back into new hypotheses, retired detections inform data-coverage gaps, and the cycle restarts.

Detection lifecycle — five stages: Hypothesis, Data, Logic, Validation, Metrics — closed-loop and metrics-driven

Fig. 1 — The closed-loop detection lifecycle

Stage 1 — Hypothesis

A detection starts as a falsifiable statement about adversary behavior in your environment. Not “find Mimikatz,” but “an unprivileged user process opening a handle to lsass.exe with PROCESS_VM_READ outside of a known software inventory is anomalous and worth alerting on.” The hypothesis names the technique (T1003.001), the telemetry source, the population it excludes, and the analyst question it answers. Most hypotheses come from three places: published threat intelligence, internal incident retros, and red-team campaign outputs. The weekly threat advisories are designed specifically to feed this stage.

Stage 2 — Data

Before a single line of query logic is written, the engineer proves out the data. This is the stage most often skipped. Three questions must be answerable in writing:

  • Which sensor or log source emits the event of interest? With what fidelity and what sample rate?
  • Does the field schema actually contain the fields the hypothesis needs (process command line, parent PID, image hash, etc.)?
  • What is the end-to-end ingestion latency, and is it within the MTTD target?

Writing detection logic against telemetry you have not validated is the single most common cause of detections that look correct in dev and never fire in prod. The data contract is the deliverable of this stage.

Stage 3 — Logic

Now the query is authored. The bias should be toward simple, explainable logic that maps cleanly to the hypothesis. Multi-stage correlation is justified only when single-event detection is provably insufficient. Authoring happens in a tool-agnostic intermediate format wherever possible — see section 05 — and compiles down to the SIEM-specific query language at deploy time.

Stage 4 — Validation

A detection is not in production until it has been exercised by an attack the defenders did not write. That means purple-teaming — the red team executes the technique using realistic tradecraft, and the engineer confirms the detection fires with the expected fidelity. The output is a validation report: which variants fire, which do not, and where the coverage gaps are. Validation is also where the false-positive budget is measured: the detection runs in shadow mode against 30 days of production telemetry, the engineer reviews every fire, and the rule does not promote until the FP rate is within budget.

Stage 5 — Metrics

Once in production, every fire is labeled by the analyst as TP, FP, or benign-true-positive (the activity happened but is authorized). The labels feed the precision, recall, and MTTD metrics described in section 06. Detections that drift below their SLO route back to Stage 1 — the hypothesis is revisited, or the rule is retired.

Why the loop matters. Detections decay. Adversaries change tradecraft, internal IT changes baselines, and what was high-signal six months ago is noise today. The lifecycle is not a one-time pipeline — it is a maintenance discipline. Treat any detection older than 12 months without a re-validation cycle as suspect by default.

Operationalize this lifecycle on your telemetry

The HackForLab platform ships every detection with its hypothesis, validation report, and live precision SLO. No black-box vendor rules.

Open the workbench

03 / Failure modes

How detections fail in production

Most published detection content reads like every rule works. In practice, detection libraries fail along a small number of predictable axes. Naming them is half the battle — once the failure modes have labels, retrospectives become productive.

Two-by-two failure quadrant: precision on the x-axis, recall on the y-axis. The four quadrants are labeled noisy-and-blind, goldilocks, dead-weight, and tunnel-vision.

Fig. 2 — The four failure quadrants of a detection library

1. Noisy & Blind (low precision, low recall)

The classic disaster state. The detection generates volume but the volume is mostly false positives, and the true positives it does catch are a small subset of the technique it claims to cover. This is what raw EDR signatures, off-the-shelf SIEM correlation packs, and over-broad regex detections degrade into within months. The remediation is rarely “tune” — it is usually “retire and rewrite from the hypothesis up.”

2. Tunnel Vision (high precision, low recall)

The detection fires only on a single variant of a technique — one specific tool, one binary hash, one literal string. When the adversary recompiles, renames, or rewrites, the rule is silent. Hash-based and string-based detections almost always live here. They have a place in the library (cheap, precise, high TP rate when they fire), but they cannot be the primary control for a technique.

3. Dead-weight (unmeasured, never fires)

The detection is enabled in the SIEM and has not fired in 90 days. Two possibilities: the technique is genuinely not happening in the environment, or the rule is broken. Without a coverage-validation pass, you cannot distinguish. The default assumption should be “broken” — vendor rule packs ship in this state more often than they ship working.

4. Goldilocks (high precision, high recall)

The state the lifecycle is engineered to produce. Tuned thresholds, validated against multiple TTP variants, with an explicit precision SLO and a known coverage map. The honest goal of a detection program is not that every rule lives here — it is that the rules in the production tier do, and the rules that drift out of this quadrant are caught and retired quickly.

Operational failure modes beyond the quadrant

The 2×2 captures the analytic failures. Two operational failures also deserve names:

  • Coverage drift. The detection itself is fine, but the upstream telemetry stopped flowing — an agent broke, a forwarder filled its disk, a log source was decommissioned and nobody updated the detection inventory. Always pair detections with a heartbeat check on their data source.
  • Provenance loss. The detection fires, the analyst escalates, and nobody can reconstruct why the rule exists or what hypothesis it was built to test. Without versioned hypothesis-to-rule provenance, every triage cycle re-derives context from scratch. This is what the “detections as code” model is designed to prevent.
04 / Coverage

Mapping detections to MITRE ATT&CK

ATT&CK is the lingua franca of detection engineering for one reason: it gives the program a coverage map that is independent of any vendor, any tool, and any specific incident. The mapping is also where most programs cut corners.

Radar chart showing detection coverage depth across ten MITRE ATT&CK tactics: initial access, execution, persistence, privilege escalation, defense evasion, credential access, discovery, lateral movement, collection, and C2/exfil.

Fig. 3 — Coverage depth radar across ten ATT&CK tactics

Map to technique, not to tactic

Claiming “we cover credential access” is meaningless. Claiming “we have validated detections for T1003.001, T1003.002, and T1003.006, and we are blind to T1003.003 because we do not have AD DS replication telemetry” is operationally useful. Every detection in the library carries one or more ATT&CK technique IDs as a tag. The tag is part of the detection-as-code metadata, not a downstream spreadsheet.

Distinguish coverage from depth

A single hash-based detection for T1003.001 is not the same as having multiple complementary detections covering process-handle, memory-region, and event-log access vectors. The coverage map should distinguish:

  • Single-vector coverage — one detection, likely to fail under variant tradecraft.
  • Multi-vector coverage — two or more detections targeting different observables for the same technique.
  • Validated coverage — multi-vector coverage that has been exercised end-to-end by purple-team within the last 90 days.

The internal coverage view on the HackForLab platform exposes these tiers explicitly — see the MITRE coverage map for the public-facing version.

Treat sub-techniques as the unit of work

Mapping at the technique level (T1003) without naming the sub-technique (T1003.001 LSASS Memory) is the most common source of false-confidence coverage maps. The tactic is too coarse; the technique is often too coarse; the sub-technique is usually the right granularity. When a technique has no sub-technique, the technique itself is the unit.

The “we cover ATT&CK” trap

A coverage map of all green is almost certainly wrong. ATT&CK has well over 200 techniques and 400+ sub-techniques. A realistic mature program achieves validated, multi-vector coverage for the 40-60 techniques most relevant to its threat model — the rest are partial, single-vector, or honest blind spots. Honesty about blind spots is more valuable than a wall of green that does not survive a real intrusion.

05 / Portability

Tool-agnostic vs SIEM-specific detections

Detection logic written natively in one SIEM’s query language is locked to that SIEM. Re-platforming — a common event over a five-year horizon — means rewriting the entire library. Programs that have lived through one migration almost always converge on an intermediate-format approach.

The intermediate format

Detection logic is authored in a tool-agnostic intermediate — commonly Sigma, a YAML schema, or a structured pseudocode — and compiled to the specific backend (KQL, SPL, ESQL, EQL, OQL) at deploy time. The intermediate captures the hypothesis, ATT&CK mapping, data source, logic, and metadata. The backend-specific output is a build artifact, not source.

This produces three concrete benefits:

  • Portability. A migration becomes a recompile, not a rewrite.
  • Review focus. Pull-request review concentrates on logic and coverage, not on query-language syntax.
  • Auditability. One canonical source of truth per detection, regardless of how many SIEMs it deploys to.
06 / Metrics

Metrics that matter: precision, recall, alert fatigue

Detection metrics fail when they measure activity rather than outcomes. “Number of rules deployed” is a vanity metric. “Mean precision across the production tier, weighted by alert volume” is an operating metric. The discipline is built on the latter.

Precision

Precision is true positives divided by true positives plus false positives: of the alerts this detection generated, what fraction were real. Precision is what determines whether analysts trust the rule. A detection with precision below 0.30 is, in practice, a coin-flip the SOC will start ignoring within two weeks. The HackForLab production tier targets precision ≥ 0.80 per rule; rules between 0.50 and 0.80 live in a “review” tier and are not promoted until tuned; below 0.50 they are retired.

Recall

Recall is true positives divided by true positives plus false negatives: of all the real instances of the technique, what fraction did the detection catch. Recall is harder to measure than precision because false negatives are, by definition, unobserved. The standard approximation is purple-team validation: the red team executes N realistic variants of the technique, and recall is measured as fires-divided-by-N. Recall measured against synthetic Atomic Red Team test cases tends to be optimistic; recall measured against an unscripted red-team campaign is the number that survives.

Mean time to detect (MTTD)

MTTD is the time from adversary action to alert raised. It decomposes into telemetry latency (sensor to data lake), query latency (data lake to rule evaluation), and alert latency (rule to analyst queue). Each component is independently measured. MTTD targets vary by detection: a credential-dumping rule needs sub-15-minute MTTD; a slow-burn beaconing rule may tolerate hours. Setting one global MTTD target across the library is a category error.

Alert fatigue

Alert fatigue is not a metric, it is a system-level outcome. The metric that proxies for it is alerts per analyst hour, measured per shift. The literature converges on a sustainable rate of roughly 8-15 actionable alerts per analyst per shift; above that, triage quality degrades sharply. When the rate exceeds the threshold, the response is never “hire more analysts” first — it is “audit the precision tier and retire the bottom decile.” Detection engineering owns this number as much as the SOC does.

Metric What it measures Target (production tier) Failure threshold
Precision TP / (TP + FP) ≥ 0.80 < 0.50 over 30 days
Recall (purple-team) TP / (TP + FN) against N variants ≥ 0.70 < 0.40 against current tradecraft
MTTD Adversary action → alert Detection-specific SLO 2× SLO sustained
Alerts per analyst hour SOC load 1-2 > 3 sustained
Coverage delta Validated multi-vector technique count +2 / quarter Flat for 2 quarters

The metric that gets gamed

Any metric that rewards detection count gets gamed. Rules get split, near-duplicates get shipped, and the library inflates. The single metric most resistant to gaming is retirement rate — how many detections were honestly retired this quarter. A program retiring zero detections is not running the lifecycle; a program retiring 15-25% of the library annually is doing the work.

07 / Field notes

Six things production detection programs get wrong

Compiled from the post-mortems of three detection-engineering rebuilds. None of these are theoretical; all of them have eaten quarter-long remediation projects.

1. Shipping detections without a hypothesis statement

If the detection’s pull request cannot answer “what adversary behavior, in which population, with which telemetry source,” it does not get reviewed. This single gate filters out roughly a third of proposed rules and saves the program from the long tail of “we built it because someone asked.”

2. Tuning by raising thresholds

When a noisy rule is “tuned” by raising a count or score threshold, recall silently degrades while the FP rate appears to improve. The right tuning move is usually to add a discriminating condition (population, baseline, parent process, time-of-day) rather than to raise the threshold. Threshold-only tuning is the leading cause of high-precision low-recall rules.

3. Treating EDR vendor rules as detection content

Vendor rules are inputs to the detection program, not outputs of it. They are unowned, unversioned, and untestable against your specific baseline. The right pattern is to ingest vendor alerts as telemetry and build owned detections that consume them as one signal among several.

4. Building correlation before single-event detection

Multi-event correlation is expensive to author, expensive to validate, and expensive to debug. The instinct to start with correlation is almost always wrong. Build the single-event detection first, prove it in production, then layer correlation only where single-event recall is provably insufficient.

5. Coverage maps that do not name blind spots

An ATT&CK coverage map that lists only the green is half a map. The blind spots — sub-techniques the program cannot detect because the telemetry does not exist — are what drives the data-engineering roadmap. Without an explicit blind-spot list, telemetry investment is reactive rather than strategic.

6. Owning detections without owning data

The detection engineer who does not own the telemetry pipeline is dependent on a team that does not feel the cost of broken data. When an agent silently stops shipping, the detection engineer is the second person to find out. Either fold telemetry ownership into the same team, or build heartbeat checks that fail loud.

PLATFORM

The HackForLab detection workbench

Hypothesis-to-production tooling for the lifecycle described on this page: tool-agnostic authoring, ATT&CK-tagged metadata, purple-team validation harness, precision SLOs per rule, and a coverage map that distinguishes single-vector from validated multi-vector coverage. No vendor lock-in to a single SIEM — detections compile to KQL, SPL, ESQL, and EQL from one canonical source.

Launch the workbench →

08 / FAQ

Frequently asked questions

Is detection engineering a separate team, or a function of the SOC?+
In organizations under roughly 50 security staff it is usually a function, often owned by a senior detection engineer or a small two-to-three-person pod that interfaces with both the SOC and threat intelligence. Above that headcount, separating it into its own team with its own backlog, on-call rotation, and SLOs is almost always the right move — the cadences (engineering vs operations) diverge enough that they compete for time on a shared team.
What is the minimum tooling required to start?+
Git for version control, a CI runner for unit tests, a synthetic event corpus (Atomic Red Team is the standard starting point), a SIEM or data platform to deploy against, and a labeling workflow for analysts to mark TP/FP on alerts. Everything else — detection-as-code frameworks, Sigma converters, validation harnesses — is value-additive but not required for day one.
How long does a single detection take from hypothesis to production?+
A well-scoped single-event detection with available telemetry: roughly 1-3 engineering days plus a 30-day shadow-mode validation window. A multi-stage correlation detection or one requiring new data sources: 2-6 weeks elapsed. The validation window is the part programs are most tempted to compress, and the part that most strongly correlates with downstream alert quality.
Should we use Sigma, or roll our own intermediate format?+
Start with Sigma. It has the largest community library, mature converters for the major SIEMs, and it has been battle-tested. Roll your own only when Sigma’s expressiveness becomes a constraint — usually for sequence detections, complex joins, or strongly-typed field schemas. Many mature programs end up running Sigma plus a small custom YAML schema for the cases Sigma does not cover. That is fine.
How do we measure recall when we do not know what we missed?+
You approximate. The two most useful approximations are: (1) purple-team validation against a defined set of technique variants — recall is fires-divided-by-variants; (2) incident-derived recall — after every confirmed intrusion, every detection that should have fired but did not is counted as a false negative against its technique. Both approximations are imperfect; both are dramatically better than not measuring at all.
Do we still need detections if we have an EDR with built-in MDR?+
Yes, for two reasons. First, MDR providers detect against your endpoints; they do not detect against your identity provider, your cloud control plane, your SaaS audit logs, your network egress, or your CI/CD systems. The detection-engineering surface area extends far beyond endpoint. Second, MDR alerts are inputs to your incident process; owning your own detections gives you observability into the MDR’s own coverage gaps. Trust but verify.
How is detection engineering related to threat hunting?+
Threat hunting is the unscheduled, hypothesis-driven search for adversary activity that current detections do not cover. The output of a productive hunt is one of two things: a confirmed intrusion (handed to IR) or a new detection hypothesis (handed to detection engineering). In a mature program the hunt-to-detection conversion rate is one of the more useful long-tail metrics — hunts that never produce detections are research; hunts that produce detections compound program value.
What is the right reporting cadence for detection metrics?+
Weekly for operational metrics (alert volume, precision-tier breaches, MTTD outliers), monthly for tier movement (promotion, demotion, retirement), and quarterly for strategic metrics (coverage delta against the threat model, blind-spot list, telemetry investment ROI). The weekly cadence catches drift; the quarterly cadence drives the roadmap. Skipping either creates a program that either firefights without strategy or strategizes without operating.