Layer 2 – Classification Views

View 4 – Evaluation-Method View

Purpose

The Evaluation-Method View classifies Layer 2 fault modes by the kind of evaluation needed to detect, measure, reproduce, or compare them.

This view answers:

What test, oracle, trace, comparison, or measurement would reveal this fault?

Layer 0 sits upstream of this view as the interface substrate: ambiguity, context dependence, pragmatic inference, discourse history, and social framing shape the scenarios being evaluated, but they are not themselves Layer 2 faults.

It is useful when designing:

  • evaluation harnesses;
  • regression suites;
  • release gates;
  • monitoring checks;
  • red-team suites;
  • product-quality reviews;
  • incident-analysis workflows;
  • Layer 2 to Layer 3 control mappings.

This view does not replace the fault inventory. It is a secondary view over the fault inventory.

A single Layer 2 fault can require several evaluation methods. A single evaluation method can detect several fault modes.

Layer 2 faults arise downstream of Layer 1A mechanisms, Layer 1B learned behaviors, and Layer 1C AI-system-level causal features.

Core distinction

Layer 2 fault mode:

The recurring behavioral failure pattern.

Evaluation method:

The procedure used to detect, measure, reproduce, or compare that behavioral failure pattern.

Example:

Layer 2 fault:
  Prompt-form sensitivity

Evaluation method:
  Prompt perturbation / paraphrase testing

Evaluation question:
  Do semantically equivalent prompts preserve the intended behavior?

The same fault may need several methods:

Fault:
  Weak grounding / source infidelity

Possible evaluation methods:
  - grounding and citation evaluation
  - context ablation / insertion testing
  - human-review rubric
  - regression / diff testing

The same method may detect several faults:

Method:
  Repeated-run testing

Can detect:
  - output variance
  - tail-risk generation
  - unstable refusal behavior
  - unstable tool calls
  - unstable citation behavior
  - rare unsafe outputs

Evaluation methods

CodeEvaluation methodCore question
EM1Repeated-run testingDoes the same scenario produce acceptably similar behavior across repeated runs?
EM2Prompt perturbation / paraphrase testingDoes behavior remain stable under semantically equivalent prompt variation?
EM3Context ablation / insertion testingDoes behavior change appropriately when relevant context is removed, added, reordered, buried, or diluted?
EM4Grounding and citation evaluationAre generated claims supported by the supplied or approved evidence?
EM5Truth / factuality evaluationAre generated claims true, regardless of whether support was supplied in the current context?
EM6Schema and parser validationDoes the output satisfy required syntax, schema, boundary, type, and formatting constraints?
EM7Reasoning / process evaluationDoes the generated reasoning or plan preserve constraints, intermediate correctness, and goal alignment?
EM8Agent trace evaluationAre tool choices, tool arguments, intermediate steps, recovery behavior, and actions correct?
EM9Calibration evaluationDoes expressed confidence or uncertainty track correctness?
EM10Safety and policy adversarial testingDoes the system preserve required safety, compliance, refusal, escalation, and authorization behavior?
EM11Stress / budget testingDoes behavior degrade under long context, limited budget, latency pressure, truncation, or task complexity?
EM12Distributional slice testingDoes performance hold across domains, languages, formats, edge cases, and rare task patterns?
EM13Regression / diff testingDid a model, prompt, retrieval, policy, schema, data, or tool change introduce new failures?
EM14Human-review / rubric evaluationDoes the output satisfy task-specific quality criteria that cannot be fully captured by deterministic checks?
EM15Production monitoring / drift evaluationDoes deployed behavior remain within acceptable bounds over time?

EM1. Repeated-run testing

Purpose

Detect unacceptable behavioral variance across repeated executions of the same scenario.

This method is for evaluating whether behavior is stable enough under repeated runs, not whether text is identical.

Evaluation question

Does the same scenario produce acceptably similar behavior across repeated runs?

Best suited for

  • behavioral instability;
  • true non-determinism / repeatability variance;
  • output variance;
  • tail-risk generation;
  • unstable refusal or escalation behavior;
  • unstable tool calls;
  • unstable citations or evidence use;
  • inconsistent confidence expression;
  • rare catastrophic outputs.

Typical setup

Run the same scenario multiple times while holding the following constant as much as possible:

  • user-visible input;
  • system instructions;
  • developer instructions;
  • model version;
  • model parameters;
  • retrieval corpus;
  • retrieved context;
  • tool availability;
  • tool outputs, if replayable;
  • runtime configuration;
  • conversation state.

Then compare outputs at the level of behavioral equivalence, not exact text.

Observable signals

  • materially different final answers;
  • different classifications;
  • different escalation decisions;
  • different risk levels;
  • different tool calls;
  • different tool arguments;
  • different citations;
  • different evidence used;
  • different external actions;
  • rare unsafe or invalid outputs;
  • rare schema failure;
  • rare hallucinated claim.

Oracle type

Behavioral-equivalence oracle.

The oracle should define which differences are acceptable, material, and critical.

Examples:

Acceptable:
  Different wording with same facts, decision, citations, and action.

Material:
  Different escalation decision.

Critical:
  Different irreversible external action.

Important boundary

Repeated-run testing is not asking whether every token is identical.

It asks whether materially relevant behavior is preserved.

Exact-text equality is usually too strict for natural-language generation and too weak for behavioral safety.

Two outputs can be worded differently while preserving the same behavior. Two outputs can look similar while differing in a critical decision, citation, tool call, or action.

EM2. Prompt perturbation / paraphrase testing

Purpose

Detect behavioral fragility under reasonable input variation.

This method tests whether the model overreacts to semantically irrelevant changes in wording, ordering, formatting, framing, examples, or tone.

Evaluation question

Do semantically equivalent or operationally equivalent prompts preserve intended behavior?

Best suited for

  • prompt-form sensitivity;
  • behavioral fragility;
  • task misinduction;
  • task blending;
  • scope misinterpretation;
  • constraint misclassification;
  • example overgeneralization;
  • example underuse;
  • inconsistent refusal behavior;
  • inconsistent escalation behavior;
  • inconsistent tool use;
  • brittle policy application.

Typical setup

Create prompt variants that preserve the intended task but vary one or more of the following:

  • wording;
  • synonym choice;
  • instruction order;
  • formatting;
  • headings;
  • delimiters;
  • directness;
  • tone;
  • examples;
  • role framing;
  • context placement;
  • degree of verbosity;
  • conversation preamble.

Example:

Variant A:
  Summarize this issue and say if it should be escalated.

Variant B:
  Briefly summarize the customer problem and decide whether escalation is needed.

A robust system should preserve the materially relevant behavior if the operational task is the same.

Observable signals

  • changed decision;
  • changed classification;
  • changed risk level;
  • changed task interpretation;
  • changed output format;
  • changed refusal behavior;
  • changed escalation behavior;
  • changed tool use;
  • changed source use;
  • changed action recommendation;
  • changed level of caution without justification.

Oracle type

Behavioral-equivalence oracle.

The oracle should define what must remain invariant across prompt variants.

For example:

Must remain invariant:
  - escalation decision
  - severity classification
  - cited policy
  - required tool call

May vary:
  - wording
  - sentence order
  - minor formatting

Important boundary

Prompt perturbation should vary irrelevant or reasonable surface form.

It should not silently change the actual task, policy, evidence, risk level, user intent, or required action.

If the prompt variant changes the underlying task, then changed behavior is not necessarily a fault.

EM3. Context ablation / insertion testing

Purpose

Detect whether the model uses runtime context appropriately.

This method evaluates whether behavior changes for the right reasons when evidence is removed, added, reordered, buried, diluted, contradicted, or made stale.

Evaluation questions

Does the model fail when required context is removed?

Does the model improve when relevant context is added?

Does irrelevant or misleading context distort the answer?

Does the model prioritize the right source when context conflicts?

Best suited for

  • context omission;
  • context underutilization;
  • context priority confusion;
  • continuity loss;
  • stale-state reliance;
  • distractor assimilation;
  • source / authority confusion;
  • parametric-prior override;
  • retrieval-conditioned answer failure;
  • weak grounding / source infidelity.

Typical setup

Evaluate variants of the same scenario with controlled context changes:

  • required evidence absent;
  • required evidence present;
  • evidence buried in the middle;
  • evidence placed near the answer point;
  • evidence split across chunks;
  • conflicting evidence;
  • irrelevant distractor chunks;
  • stale context;
  • low-authority source versus high-authority source;
  • summarized context versus verbatim context;
  • noisy retrieval set versus clean retrieval set.

Observable signals

  • ignores present evidence;
  • answers from prior knowledge despite supplied evidence;
  • follows irrelevant context;
  • cites weak source over strong source;
  • cites a document while missing the key span;
  • changes answer for the wrong reason;
  • fails to abstain when evidence is absent;
  • fails to update when governing evidence is inserted;
  • overweights stale context;
  • mixes retrieved facts with unsupported assumptions.

Oracle type

Evidence-sensitivity oracle.

The oracle should specify how the answer should change as evidence changes.

Examples:

If governing policy is absent:
  The system should abstain or say evidence is insufficient.

If governing policy is present:
  The system should answer according to that policy.

If irrelevant distractor evidence is added:
  The material answer should not change.

Important boundary

This method detects Layer 2 context-use behavior.

Whether the retriever failed to fetch the required context is a Layer 3 question.

Layer 2 asks:

Did the model use, ignore, overuse, or misprioritize the context it received?

Layer 3 asks:

Why did the system supply, omit, order, chunk, retrieve, or validate that context in that way?

EM4. Grounding and citation evaluation

Purpose

Detect whether generated claims are supported by supplied or approved evidence.

This method evaluates source fidelity, not general truth.

Evaluation question

Does each material claim trace to evidence that actually supports it?

Best suited for

  • unsupported assertion;
  • non-grounded justification;
  • fabricated citation or source;
  • evidence-claim mismatch;
  • source infidelity;
  • weak grounding;
  • retrieval-conditioned answer failure;
  • parametric-prior override;
  • answer unsupported by approved sources.

Typical setup

Extract material claims from the output and compare them against:

  • supplied documents;
  • retrieved chunks;
  • tool outputs;
  • approved sources;
  • cited passages;
  • structured records;
  • governing policy text;
  • source metadata.

The evaluation should distinguish at least four cases:

Supported:
  The cited or supplied evidence entails the claim.

Unsupported:
  The claim may be true, but the allowed evidence does not support it.

Contradicted:
  The allowed evidence conflicts with the claim.

Unverifiable:
  The allowed evidence is insufficient to determine support.

Observable signals

  • claim has no source;
  • cited source does not contain the claim;
  • cited source contradicts the claim;
  • cited source supports a weaker claim;
  • citation is invented;
  • citation is malformed;
  • explanation does not entail conclusion;
  • retrieved fact is mixed with model assumption;
  • answer uses background knowledge when task requires supplied-source grounding.

Oracle type

Evidence-entailment oracle.

Possible implementations:

  • human evidence review;
  • expert review;
  • claim extraction plus source matching;
  • natural-language inference model;
  • citation validator;
  • deterministic lookup when source is structured;
  • hybrid automated and human review.

Important boundary

Grounding is not the same as truth.

A claim can be true but unsupported by the supplied evidence.

A claim can also be supported by a bad or outdated source but false in the world.

Grounding evaluation asks:

Is the claim supported by the evidence the system was allowed to use?

Truth evaluation asks:

Is the claim actually correct?

EM5. Truth / factuality evaluation

Purpose

Detect whether generated factual claims are true.

This method evaluates correspondence with reality or a trusted reference, regardless of whether support was supplied in the current context.

Evaluation question

Is the generated claim factually correct?

Best suited for

  • plausibility-truth gap;
  • fluent false answer;
  • common misconception reproduction;
  • stale latent knowledge;
  • approximate or outdated knowledge;
  • entity confusion;
  • false premise continuation;
  • overgeneralization from familiar patterns;
  • invented details;
  • domain-specific factual failure.

Typical setup

Compare claims against trusted references, such as:

  • gold labels;
  • authoritative databases;
  • official sources;
  • deterministic calculators;
  • verified tools;
  • expert review;
  • current records;
  • controlled answer keys;
  • formal tests;
  • task-specific adjudication.

Observable signals

  • false factual statement;
  • outdated statement;
  • incorrect date;
  • incorrect number;
  • incorrect name;
  • incorrect entity relationship;
  • invented detail;
  • false premise accepted;
  • plausible but wrong explanation;
  • unsupported generalization presented as fact.

Oracle type

Truth oracle.

Possible forms:

  • reference answer;
  • expert oracle;
  • tool-backed oracle;
  • database lookup;
  • unit test;
  • formal verifier;
  • adjudicated label.

Important boundary

Truth evaluation and grounding evaluation are separate.

Truth asks:

Is this claim correct?

Grounding asks:

Is this claim supported by the evidence the system was allowed to use?

Some tasks need both.

Example:

The answer may be factually true,
but invalid for a RAG assistant if it is not supported by the retrieved documents.

EM6. Schema and parser validation

Purpose

Detect whether output satisfies required structural contracts.

This method is appropriate when the target property is formal, syntactic, typed, or exactly checkable.

Evaluation question

Does the output conform to the required syntax, schema, boundary, and type constraints?

Best suited for

  • output-format drift;
  • structured output drift;
  • boundary and stopping error;
  • exact-string corruption;
  • numeric or symbolic fragility;
  • malformed tool arguments;
  • field omission;
  • enum mismatch;
  • extra commentary;
  • escaping errors;
  • invalid serialization.

Typical setup

Run deterministic validators over the output:

  • JSON parser;
  • XML parser;
  • YAML parser;
  • regex;
  • schema validator;
  • type checker;
  • SQL parser;
  • enum checker;
  • exact-string checker;
  • checksum;
  • unit test;
  • field-completeness check;
  • tool-call payload validator;
  • API contract validator.

Observable signals

  • invalid JSON;
  • invalid XML;
  • missing fields;
  • wrong field names;
  • wrong data types;
  • invalid enum value;
  • extra commentary;
  • broken escaping;
  • malformed identifiers;
  • corrupted IDs;
  • truncated output;
  • duplicate fields;
  • wrong nesting;
  • invalid tool-call payload.

Oracle type

Deterministic parser or validator.

This is often cheaper and more reliable than model-based or human review when the target property is formal.

Important boundary

Schema validity is not semantic correctness.

An output can parse successfully while containing wrong values.

Example:

{
  "escalate": false,
  "reason": "No urgency detected"
}

This may be valid JSON and still be wrong if the source ticket clearly requires escalation.

Use EM6 for structural validity. Use EM4, EM5, EM7, EM8, or EM14 for semantic correctness, process correctness, or product quality.

EM7. Reasoning / process evaluation

Purpose

Detect multi-step reasoning, planning, decomposition, or constraint-preservation failures.

This method evaluates the produced reasoning artifact, plan, intermediate steps, or process representation. It does not assume the model’s visible reasoning is a faithful transcript of hidden cognition.

Evaluation question

Does the reasoning or plan preserve correctness, constraints, and goal alignment across steps?

Best suited for

  • local plausibility drift;
  • path dependence;
  • error accumulation;
  • invariant loss;
  • plan drift;
  • spurious decomposition;
  • premature closure;
  • looping or repetition;
  • false intermediate assumption;
  • conclusion not supported by prior steps.

Typical setup

Evaluate:

  • intermediate claims;
  • derived assumptions;
  • plan checkpoints;
  • constraint preservation;
  • step ordering;
  • dependencies between steps;
  • whether the final answer follows from prior steps;
  • whether the plan remains aligned with the original objective.

Possible evaluators:

  • expert review;
  • checklist-based rubric;
  • invariant checker;
  • formal verifier;
  • test-case execution;
  • theorem prover;
  • code runner;
  • arithmetic checker;
  • simulation;
  • state-transition checker.

Observable signals

  • early mistake propagates;
  • hidden contradiction;
  • invalid intermediate step;
  • goal drift;
  • dropped constraint;
  • invented subtask;
  • conclusion does not follow;
  • plan no longer matches original objective;
  • repeated step without progress;
  • premature final answer;
  • unsupported leap from evidence to conclusion.

Oracle type

Step-level oracle, invariant checker, expert rubric, formal checker, or task-specific process rubric.

Important boundary

Reasoning traces are not automatically reliable evidence of actual internal reasoning.

This method evaluates the produced process artifact or observable action trace, not hidden cognition.

For high-risk workflows, process evaluation should be paired with final-answer evaluation and, where relevant, agent trace evaluation.

EM8. Agent trace evaluation

Purpose

Detect failures in tool use, action selection, process control, state handling, and recovery.

This method is required when the system is not merely producing text, but also planning, calling tools, reading tool outputs, maintaining state, or taking actions.

Evaluation question

Did the agent choose the right steps, tools, arguments, checks, and actions?

Best suited for

  • wrong tool choice;
  • missing tool call;
  • unnecessary tool call;
  • wrong tool arguments;
  • malformed tool arguments;
  • tool-output misinterpretation;
  • skipped step;
  • unnecessary step;
  • loop;
  • premature stopping;
  • recovery failure;
  • unsafe or unjustified action;
  • action without sufficient evidence;
  • state update error.

Typical setup

Record and evaluate the full trace:

  • user request;
  • task state;
  • plan;
  • tool availability;
  • selected tool;
  • tool arguments;
  • tool outputs;
  • intermediate decisions;
  • retries;
  • error handling;
  • state updates;
  • final response;
  • external actions;
  • authorization events.

Observable signals

  • selected wrong tool;
  • skipped required lookup;
  • passed wrong argument;
  • ignored tool error;
  • retried without changing the failure condition;
  • looped;
  • stopped before completion;
  • acted without sufficient evidence;
  • executed irreversible action without authorization;
  • misread tool output;
  • overwrote correct state with incorrect state;
  • produced correct final text after an unsafe or invalid process.

Oracle type

Trace oracle, process rubric, tool-call validator, state-transition checker, or action-safety oracle.

Important boundary

Agent trace evaluation is not only final-answer evaluation.

An agent can produce a good final message after a bad process.

An agent can also produce a bad final outcome after individually plausible steps.

For agentic systems, final answer, process trace, tool correctness, and action safety should be evaluated separately.

EM9. Calibration evaluation

Purpose

Detect whether expressed confidence, uncertainty, self-assessment, or abstention behavior tracks correctness.

This method treats confidence language as generated behavior, not as a native calibrated probability.

Evaluation question

Does the model’s confidence, uncertainty, or self-assessment correspond to actual reliability?

Best suited for

  • weak confidence calibration;
  • misleading confidence;
  • non-privileged self-evaluation;
  • high-confidence wrong answers;
  • over-hedged correct answers;
  • inconsistent uncertainty across runs;
  • unsupported self-certainty;
  • unreliable self-critique;
  • false assurance after shallow checking.

Typical setup

Collect outputs with one or more of the following:

  • confidence language;
  • uncertainty markers;
  • abstention decisions;
  • self-assessment statements;
  • self-critique results;
  • numeric confidence estimates;
  • risk labels;
  • escalation decisions.

Then compare them with correctness, grounding, policy, or task-success labels.

Observable signals

  • confident wrong answer;
  • over-hedged correct answer;
  • “I checked” without independent verification;
  • self-critique fails to catch error;
  • numeric confidence is not predictive;
  • uncertainty varies across equivalent cases;
  • confidence increases after unsupported reasoning;
  • refusal or abstention does not correlate with actual uncertainty.

Oracle type

Empirical calibration curve, labeled correctness set, expert review, or confidence-vs-accuracy analysis.

Possible metrics:

  • expected calibration error;
  • Brier score;
  • accuracy by confidence bucket;
  • abstention precision/recall;
  • selective prediction curves;
  • confidence/risk-label confusion matrix.

Important boundary

Generated confidence language is behavior, not a calibrated reliability measure.

Self-evaluation is not independent verification.

If a task requires verification, use external checks, tools, evidence, labels, or human review rather than relying only on generated self-assessment.

EM10. Safety and policy adversarial testing

Purpose

Detect failures to preserve safety, compliance, refusal, escalation, authorization, privacy, and other behavioral boundaries under adversarial, ambiguous, or high-risk conditions.

Evaluation question

Does the system maintain required behavioral boundaries under adversarial, ambiguous, or high-risk conditions?

Best suited for

  • under-refusal;
  • over-refusal;
  • prompt-injection compliance;
  • control/data confusion;
  • unsafe action readiness;
  • sycophantic agreement;
  • policy inconsistency;
  • unauthorized recommendation;
  • sensitive-data leakage;
  • harmful or biased output;
  • failure to escalate;
  • action without authorization;
  • high-stakes advice beyond scope.

Typical setup

Run scenarios that test boundaries:

  • malicious instructions;
  • ambiguous user intent;
  • conflicting instructions;
  • policy edge cases;
  • sensitive data;
  • high-stakes domains;
  • irreversible actions;
  • user pressure;
  • role-play attempts;
  • retrieved prompt injection;
  • jailbreak-like phrasing;
  • misleading context;
  • social-engineering attempts;
  • low-authority instructions embedded in data.

Observable signals

  • follows malicious embedded instruction;
  • refuses allowed task;
  • complies with disallowed task;
  • gives high-stakes advice beyond scope;
  • leaks private information;
  • skips escalation;
  • takes action without authorization;
  • applies policy inconsistently;
  • treats untrusted text as instruction;
  • agrees with a harmful or false user premise;
  • reveals implementation details or confidential context.

Oracle type

Policy oracle, safety rubric, authorization oracle, privacy oracle, expert review, or red-team scenario set.

Important boundary

Safety evaluation should distinguish model behavior from system controls.

Example:

Layer 2 behavioral fault:
  The model follows an injected instruction in retrieved text.

Layer 3 system fault:
  The system failed to isolate, quote, neutralize, or sandbox untrusted retrieved text.

The same incident can involve both, but they should be recorded separately.

EM11. Stress / budget testing

Purpose

Detect behavior degradation under token, latency, compute, memory, context, or task-complexity pressure.

This method evaluates whether the system preserves required behavior as resource conditions become difficult.

Evaluation question

Does the system preserve required behavior when the task approaches practical resource limits?

Best suited for

  • truncation-induced loss;
  • compression-induced distortion;
  • budget-induced incompleteness;
  • long-context degradation;
  • shallow verification;
  • skipped reasoning;
  • partial plans;
  • incomplete outputs;
  • missing exceptions;
  • over-compressed summaries;
  • late-answer cutoff;
  • context-window pressure.

Typical setup

Evaluate scenarios with:

  • long documents;
  • long conversations;
  • many retrieved chunks;
  • dense retrieved context;
  • large schemas;
  • multi-step tasks;
  • constrained latency;
  • limited output budget;
  • multiple tool calls;
  • required verification passes;
  • high branching factor;
  • large state objects;
  • repeated summarization or compression.

Observable signals

  • omitted exception;
  • dropped constraint;
  • incomplete answer;
  • truncated output;
  • shallow analysis;
  • skipped verification;
  • over-compressed summary;
  • lost source distinction;
  • premature final answer;
  • missing tool call;
  • missing final step;
  • hallucination after context pressure;
  • degradation only at long length.

Oracle type

Completeness rubric, coverage checker, source-comparison oracle, stress benchmark, or resource-sweep analysis.

Important boundary

Budget testing detects behavioral degradation.

The product decision to set a low token, latency, or cost budget belongs to Layer 3, product design, or operational policy.

Layer 2 asks:

What behavioral fault appears under resource pressure?

Layer 3 asks:

What resource, architecture, routing, compression, or fallback decision allowed it?

EM12. Distributional slice testing

Purpose

Detect uneven competence across domains, languages, formats, edge cases, user groups, source types, and task framings.

This method prevents aggregate scores from hiding sharp local failures.

Evaluation question

Does performance hold across the relevant slices of the product distribution?

Best suited for

  • uneven competence;
  • distributional failure;
  • distributional overgeneralization;
  • rare-format brittleness;
  • multilingual weakness;
  • benchmark/product mismatch;
  • domain-specific failures;
  • symbolic-task failures;
  • edge-case failure;
  • low-resource-language failure;
  • specialized-document failure.

Typical setup

Define slices by variables relevant to the product:

  • domain;
  • language;
  • script;
  • document type;
  • user type;
  • task type;
  • format;
  • region;
  • edge case;
  • policy category;
  • difficulty;
  • source quality;
  • evidence availability;
  • tool availability;
  • customer segment;
  • rare entity type;
  • temporal freshness;
  • ambiguity level.

Measure performance per slice, not only in aggregate.

Observable signals

  • strong average performance but weak slice performance;
  • failure on rare formats;
  • failure in specific language or script;
  • degradation on edge cases;
  • overgeneralization from familiar pattern;
  • confident answer outside competence region;
  • policy inconsistency across similar groups;
  • weak performance on real production distribution despite benchmark strength.

Oracle type

Slice-level benchmark, stratified labeled set, expert review, or production-slice audit.

Important boundary

Distributional slice testing should not only report aggregate scores.

The point is to reveal hidden capability cliffs.

A system can be acceptable on average and unacceptable on a critical slice.

EM13. Regression / diff testing

Purpose

Detect whether a change introduced new behavioral failures.

This method compares behavior across versions, configurations, prompts, retrieval setups, tools, schemas, policies, or datasets.

Evaluation question

Did this change improve, preserve, or degrade behavior across the relevant scenario set?

Best suited for

  • prompt regressions;
  • model-version regressions;
  • retrieval regressions;
  • schema regressions;
  • policy regressions;
  • tool-use regressions;
  • hidden behavioral drift;
  • format drift after prompt edits;
  • safety drift;
  • grounding drift;
  • latency/cost quality tradeoff regressions.

Typical setup

Run the same scenario suite before and after a change.

Compare:

  • final answers;
  • decisions;
  • classifications;
  • citations;
  • evidence used;
  • tool calls;
  • tool arguments;
  • refusal behavior;
  • escalation behavior;
  • output-format validity;
  • latency;
  • cost;
  • trace quality;
  • safety outcomes;
  • human rubric scores.

Observable signals

  • formerly passing scenario fails;
  • improved narrow case but worsened adjacent cases;
  • changed tool route;
  • changed citation source;
  • changed refusal decision;
  • changed escalation decision;
  • changed output schema;
  • changed latency or cost profile;
  • changed grounding behavior;
  • new tail-risk failure;
  • new slice-specific failure.

Oracle type

Behavioral diff, regression gate, scenario benchmark, human review, or pairwise comparison.

Important boundary

A diff is not automatically a regression.

The evaluator must classify whether the difference is:

Acceptable:
  Surface variation without material behavior change.

Improvement:
  Better factuality, grounding, safety, usefulness, or efficiency.

Material regression:
  Worse decision, evidence use, schema validity, safety, or task success.

Critical regression:
  New unsafe, non-compliant, irreversible, or high-impact failure.

Regression testing should use task-specific behavioral equivalence criteria, not only text similarity.

It should also assume that change impact may be non-local: the edited prompt, model, retriever, schema, or policy can introduce failures outside the originally touched case or manually inspected slice.

EM14. Human-review / rubric evaluation

Purpose

Evaluate outputs where correctness is semantic, contextual, subjective, policy-sensitive, or task-specific rather than fully deterministic.

This method is needed when parser checks, truth labels, or simple matching are insufficient.

Evaluation question

Does the output satisfy the product’s quality criteria for this task?

Best suited for

  • summarization quality;
  • tone and product fit;
  • usefulness;
  • completeness;
  • relevance;
  • nuanced policy application;
  • ambiguous task success;
  • acceptable variation judgment;
  • user-experience consistency;
  • escalation judgment;
  • explanation quality;
  • answer helpfulness;
  • product-specific quality.

Typical setup

Use a structured rubric with explicit criteria.

Possible criteria:

  • factuality;
  • completeness;
  • relevance;
  • source fidelity;
  • decision accuracy;
  • tone;
  • concision;
  • policy compliance;
  • action safety;
  • downstream usefulness;
  • clarity;
  • user burden;
  • appropriate uncertainty;
  • appropriate refusal or escalation;
  • consistency with product voice.

Rubric scales should be defined clearly.

Example:

Completeness
1 — misses critical required information
2 — includes some relevant information but omits important details
3 — covers main points but misses minor details
4 — complete for practical use
5 — complete and well prioritized

Observable signals

  • missing key information;
  • misleading emphasis;
  • wrong tone;
  • overlong answer;
  • under-informative answer;
  • technically correct but unhelpful answer;
  • inconsistent review judgments;
  • policy nuance missed;
  • answer does not satisfy downstream user need;
  • unclear or overcomplicated explanation.

Oracle type

Human rubric, expert rubric, pairwise preference, task-specific quality score, or reviewer adjudication.

Important boundary

Human review should not remain informal.

If reviewers disagree, the rubric may be underspecified.

A good human-review setup should define:

  • criteria;
  • scale anchors;
  • examples;
  • reviewer instructions;
  • adjudication process;
  • inter-rater agreement expectations;
  • escalation path for ambiguous cases.

EM15. Production monitoring / drift evaluation

Purpose

Detect whether deployed behavior remains acceptable as users, data, tools, prompts, models, and environments change over time.

This method extends evaluation from pre-release testing into ongoing operation.

Evaluation question

Does deployed behavior remain stable, safe, useful, and correct under real production conditions?

Best suited for

  • production drift;
  • retrieval-index or source drift;
  • tool/API drift;
  • policy drift;
  • latent regressions not covered by offline suites;
  • long-tail failures;
  • incident detection;
  • silent degradation.

Typical setup

Use telemetry, sampled review, automated checks, replayed incidents, and trend monitoring.

Production monitoring is the runtime backstop for residual regressions whose effects were broader or harder to predict than offline change-focused evaluation could fully cover.

Possible signals include:

  • user-feedback rates;
  • escalation-rate drift;
  • tool-call failure rates;
  • retrieval miss rates;
  • citation-support rates;
  • schema failure rates;
  • safety-policy hits;
  • latency and cost drift;
  • changed input distributions.

Important boundary

Production monitoring is still part of the evaluation view. It detects, measures, and surfaces faults. It becomes a Layer 3 control only when tied to alerts, gates, escalation paths, or other operational responses.

Fault-family to evaluation-method mapping

This table maps broad Layer 2 fault families to likely evaluation methods.

Fault familyPrimary methodsSecondary methods
FF1. Behavioral InstabilityEM1 Repeated-run testing; EM2 Prompt perturbationEM13 Regression / diff testing; EM14 Human-review rubric; EM15 Production monitoring
FF2. Ambiguous or Misinduced Task BehaviorEM2 Prompt perturbation; EM14 Human-review rubricEM7 Reasoning / process evaluation; EM13 Regression / diff testing
FF3. Hallucination and Unsupported ClaimsEM4 Grounding and citation evaluation; EM5 Truth / factuality evaluationEM9 Calibration evaluation; EM14 Human-review rubric
FF4. Weak Grounding / Source InfidelityEM4 Grounding and citation evaluation; EM3 Context ablation / insertion testingEM13 Regression / diff testing; EM14 Human-review rubric; EM15 Production monitoring
FF5. Weak Calibration and Misleading ConfidenceEM9 Calibration evaluationEM1 Repeated-run testing; EM5 Truth / factuality evaluation; EM14 Human-review rubric
FF6. Output Format / Schema DriftEM6 Schema and parser validationEM13 Regression / diff testing; EM11 Stress / budget testing
FF7. Inconsistent Interaction BehaviorEM14 Human-review rubric; EM1 Repeated-run testingEM2 Prompt perturbation; EM13 Regression / diff testing
FF8. Uneven Competence / Distributional FailureEM12 Distributional slice testingEM5 Truth / factuality evaluation; EM14 Human-review rubric
FF9. Agentic Process FailureEM8 Agent trace evaluationEM10 Safety and policy adversarial testing; EM13 Regression / diff testing; EM15 Production monitoring
FF10. Retrieval-Conditioned Answer FailureEM3 Context ablation / insertion testing; EM4 Grounding and citation evaluationEM11 Stress / budget testing; EM13 Regression / diff testing; EM15 Production monitoring

Atomic fault to evaluation-method mapping

This table gives example mappings for common atomic faults. It is not exhaustive.

Atomic faultPrimary methodsNotes
Context omissionEM3 Context ablation / insertionTest absent versus present evidence.
Context underutilizationEM3 Context ablation / insertionInsert evidence and check whether it affects behavior.
Context priority confusionEM3 Context ablation / insertionUse conflicting high- and low-authority sources.
Continuity lossEM3 Context ablation / insertion; EM11 Stress / budget testingTest multi-turn and cross-step state retention.
Prompt-form sensitivityEM2 Prompt perturbationCompare semantically equivalent prompts.
Task misinductionEM2 Prompt perturbation; EM14 Human-review rubricCheck whether inferred task matches intended operation.
Constraint misclassificationEM2 Prompt perturbation; EM7 Reasoning / process evaluationCheck hard versus soft constraints, exceptions, and examples.
Control/data confusionEM10 Safety and policy adversarial testingTest untrusted text, quoted instructions, retrieved injection.
Output-format driftEM6 Schema and parser validationUse deterministic validators.
Boundary and stopping errorEM6 Schema and parser validation; EM11 Stress / budget testingDetect extra commentary, truncation, premature stop.
Exact-string corruptionEM6 Schema and parser validationUse exact-match, checksum, or identifier validation.
Numeric or symbolic fragilityEM6 Schema and parser validation; EM5 Truth / factuality evaluationUse calculators, formal checks, unit tests.
Local plausibility driftEM7 Reasoning / process evaluationCheck global task alignment over generated sequence.
Path dependenceEM7 Reasoning / process evaluation; EM2 Prompt perturbationTest early assumption variation.
Error accumulationEM7 Reasoning / process evaluationEvaluate intermediate steps.
Unsupported assertionEM4 Grounding and citation evaluationClaim must be supported by approved evidence.
Plausibility-truth gapEM5 Truth / factuality evaluationClaim may sound plausible but be false.
Non-grounded justificationEM4 Grounding and citation evaluationExplanation must actually support conclusion.
Fabricated citation or sourceEM4 Grounding and citation evaluationValidate source existence and support.
Evidence-claim mismatchEM4 Grounding and citation evaluationSource exists but does not support claim.
Weak confidence calibrationEM9 Calibration evaluationCompare confidence to correctness.
Non-privileged self-evaluationEM9 Calibration evaluation; EM5 Truth / factuality evaluationSelf-check is not independent verification.
Over-refusalEM10 Safety and policy adversarial testing; EM14 Human-review rubricTest allowed requests near policy boundary.
Under-refusalEM10 Safety and policy adversarial testingTest disallowed or high-risk requests.
Clarification failureEM14 Human-review rubric; EM2 Prompt perturbationCheck when clarification is necessary versus unnecessary.
Tone/persona inconsistencyEM14 Human-review rubric; EM1 Repeated-run testingProduct voice is usually rubric-based.
Competence cliffEM12 Distributional slice testingTest by domain, format, language, edge case.
Output varianceEM1 Repeated-run testingCompare behavioral equivalence across runs.
Tail-risk generationEM1 Repeated-run testing; EM10 Safety testingRequires repeated trials and adversarial scenarios.
Truncation-induced lossEM11 Stress / budget testingIncrease context or output length pressure.
Compression-induced distortionEM11 Stress / budget testing; EM4 Grounding evaluationCompare compressed state to source.
Tool-selection errorEM8 Agent trace evaluationCheck tool choice against task state.
Tool-argument errorEM8 Agent trace evaluation; EM6 Schema validationValidate argument syntax and semantics.
Tool-output misinterpretationEM8 Agent trace evaluation; EM4 Grounding evaluationCompare model interpretation to tool output.
Action-readiness errorEM8 Agent trace evaluation; EM10 Safety testingCheck evidence, authorization, and reversibility.
Recovery failureEM8 Agent trace evaluationTest tool errors, missing data, and retry behavior.

Evaluation-method selection guide

Use this guide when choosing which method to apply first.

If the failure varies across runs

Start with:

EM1 Repeated-run testing

Then add:

EM13 Regression / diff testing
EM10 Safety and policy adversarial testing, if high-risk

If the failure appears after small wording changes

Start with:

EM2 Prompt perturbation / paraphrase testing

Then add:

EM14 Human-review / rubric evaluation
EM13 Regression / diff testing

If the failure concerns evidence use

Start with:

EM3 Context ablation / insertion testing
EM4 Grounding and citation evaluation

Then add:

EM5 Truth / factuality evaluation
EM13 Regression / diff testing

If the failure concerns factual correctness

Start with:

EM5 Truth / factuality evaluation

Then add:

EM4 Grounding and citation evaluation, if evidence-based answer is required
EM9 Calibration evaluation, if confidence is relevant

If the failure concerns output structure

Start with:

EM6 Schema and parser validation

Then add:

EM14 Human-review / rubric evaluation, if semantic field correctness matters
EM13 Regression / diff testing

If the failure concerns a multi-step plan

Start with:

EM7 Reasoning / process evaluation

Then add:

EM8 Agent trace evaluation, if tools or actions are involved
EM11 Stress / budget testing, if complexity is high

If the failure concerns tool use or external action

Start with:

EM8 Agent trace evaluation

Then add:

EM10 Safety and policy adversarial testing
EM6 Schema and parser validation

If the failure concerns unsafe, non-compliant, or unauthorized behavior

Start with:

EM10 Safety and policy adversarial testing

Then add:

EM8 Agent trace evaluation, if tools/actions are involved
EM1 Repeated-run testing, if rare failures matter
EM13 Regression / diff testing, if release gating is needed

If the failure appears only on certain domains, languages, or formats

Start with:

EM12 Distributional slice testing

Then add:

EM5 Truth / factuality evaluation
EM14 Human-review / rubric evaluation

If the failure appears after a model, prompt, data, policy, or tool change

Start with:

EM13 Regression / diff testing

Then add whichever method corresponds to the observed regression:

EM4 for grounding regressions
EM6 for schema regressions
EM8 for tool-use regressions
EM10 for safety regressions
EM12 for slice regressions

Design rules for the Evaluation-Method View

1. Evaluate behavior, not surface text by default

For generative tasks, exact text equality is usually too strict.

The default unit should be intended behavior:

  • facts;
  • decision;
  • risk level;
  • evidence;
  • citations;
  • tool use;
  • external action;
  • refusal or escalation behavior;
  • format validity;
  • user-facing commitment.

2. Separate truth from grounding

Truth evaluation asks:

Is the claim correct?

Grounding evaluation asks:

Is the claim supported by the evidence the system was allowed to use?

Both may be necessary, but they are not the same.

3. Separate final-answer evaluation from process evaluation

Some tasks can be judged from final output alone.

Agentic and high-risk workflows usually require trace-level evaluation.

Final-answer success does not prove process correctness.

4. Use deterministic validators where possible

Schemas, parsers, type checks, enum checks, exact-string validators, and formal tests are cheaper and more reliable than rubric evaluation when the target property is formal.

Use human review where semantic judgment is required, not where deterministic validation would suffice.

5. Use human or expert rubrics where necessary

Some properties require semantic judgment, especially:

  • usefulness;
  • policy nuance;
  • tone;
  • completeness;
  • acceptable variation;
  • downstream value;
  • explanation quality;
  • appropriate refusal or escalation.

Rubrics should be explicit enough that reviewers can apply them consistently.

6. Evaluate rare failures explicitly

Small test samples can miss tail-risk behavior.

Repeated-run testing, adversarial testing, and stress testing should be used when rare failures are severe enough to matter.

7. Do not confuse evaluation methods with system controls

An evaluation can reveal a fault.

It does not itself prevent the fault unless it is connected to a gate, monitor, retry, validator, fallback, authorization check, or other Layer 3 control.

Example:

EM6 Schema validation as evaluation:
  Detects invalid JSON in test outputs.

Layer 3 validator as control:
  Rejects invalid JSON at runtime and triggers retry or fallback.

8. Prefer task-specific behavioral equivalence criteria

Many LLM outputs have multiple acceptable forms.

Define what must be preserved for the task:

  • classification;
  • decision;
  • evidence;
  • action;
  • risk level;
  • policy behavior;
  • structured fields;
  • required omissions;
  • required uncertainty.

Do not rely only on text similarity.

9. Classify evaluation failures at the right layer

An observed bad result may involve multiple layers:

Layer 0:
  Interface conditions that shape how meaning must be inferred.

Layer 1A / 1B / 1C:
  Mechanisms, learned behaviors, and AI-system-level causal features
  that make the behavior possible.

Layer 2:
  Behavioral fault mode detected by the evaluation.

Layer 3:
  Missing or inadequate system control.

Layer 4:
  User, business, safety, compliance, or operational impact.

The evaluation method detects evidence of the fault; it is not the fault itself.

Anti-patterns

Anti-pattern 1: Treating exact-match failure as behavioral failure

Bad classification:

The wording changed, so the model failed.

Better classification:

The wording changed, but the answer preserved the same facts, decision,
evidence, and action. This is acceptable variation.

Exact match is appropriate for strict strings, IDs, schemas, code, or serialized data. It is usually not appropriate for open-ended natural-language answers.

Anti-pattern 2: Treating a passing demo as reliability evidence

Bad classification:

The scenario passed once, so the system is reliable.

Better classification:

Run repeated trials and measure behavioral equivalence across runs.

One successful run only proves that the system worked once.

Anti-pattern 3: Collapsing grounding into truth

Bad classification:

The answer is true, so grounding passed.

Better classification:

The answer may be true, but grounding passes only if the allowed evidence supports it.

Anti-pattern 4: Collapsing process evaluation into final-answer evaluation

Bad classification:

The final answer was acceptable, so the agent succeeded.

Better classification:

For agentic workflows, evaluate tool choice, arguments, intermediate steps,
state handling, recovery behavior, and action safety.

Anti-pattern 5: Using human review where deterministic validation is available

Bad classification:

Ask a reviewer whether the JSON is valid.

Better classification:

Use a parser and schema validator.
Ask humans to review semantic quality if needed.

Anti-pattern 6: Treating evaluation as mitigation

Bad classification:

We have an evaluation for hallucination, so hallucination is controlled.

Better classification:

The evaluation detects hallucination.
Runtime mitigation requires Layer 3 controls such as grounding requirements,
citation validation, abstention rules, retrieval repair, or human review.

Anti-pattern 7: Using only aggregate scores

Bad classification:

The system has 90% accuracy, so it is good enough.

Better classification:

Check distributional slices. A critical slice may fail even when aggregate performance is high.

Recommended metadata fields

Each evaluation method record should support the following metadata when used in an evaluation harness:

evaluation_method_code: EM4
evaluation_method_name: Grounding and citation evaluation
layer_2_faults_detected:
  - unsupported_assertion
  - fabricated_citation
  - evidence_claim_mismatch
scenario_id: string
input_variant_id: string
model_version: string
prompt_version: string
retrieval_version: string
tool_versions: []
source_corpus_version: string
runtime_config: {}
expected_behavior: string
oracle_type: evidence_entailment
pass_fail: pass | fail | inconclusive
severity: cosmetic | minor | material | safety_relevant | compliance_relevant | irreversible_action
observed_signal: string
evidence: []
trace_link: string
reviewer: string
review_notes: string
linked_layer_3_controls: []

This metadata keeps evaluation findings connected to the Layer 2 fault inventory and the Layer 3 control mapping.

Minimal harness fields by method

MethodRequired minimum fields
EM1 Repeated-run testingscenario ID, run ID, seed/config if available, behavioral equivalence criteria, outcome comparison
EM2 Prompt perturbationbase prompt, prompt variant, invariant behavior criteria, outcome comparison
EM3 Context ablation / insertioncontext variant, expected evidence sensitivity, answer comparison
EM4 Grounding and citationclaim, cited/source evidence, support label, evidence span
EM5 Truth / factualityclaim, trusted reference, truth label
EM6 Schema/parseroutput, schema/parser version, validation result, error message
EM7 Reasoning/processstep trace, expected invariant, process label
EM8 Agent tracetool calls, arguments, outputs, state transitions, action labels
EM9 Calibrationconfidence expression/value, correctness label, calibration bucket
EM10 Safety/policypolicy scenario, allowed/disallowed behavior, observed behavior, severity
EM11 Stress/budgetresource condition, limit tested, expected preservation behavior, degradation label
EM12 Distributional sliceslice label, scenario set, per-slice metric, comparison to aggregate
EM13 Regression/diffbaseline version, candidate version, diff type, regression label
EM14 Human rubricrubric version, reviewer ID, criterion scores, adjudication if needed
EM15 Production monitoringmetric definition, baseline window, alert threshold, sampling or review process

Relationship to other Layer 2 documents

This document should be used with:

stack-21-fault-inventory.md
  Defines atomic Layer 2 fault modes.

stack-23-fault-family-index.md
  Defines broad FF1-FF10 family groupings.

stack-25-evaluation-mapping.md
  Can expand this view into a full harness design.

stack-31-layer-3-control-families.md
  Defines the Layer 3 control families that respond to detected Layer 2 faults.

stack-24-classification-views.md should not duplicate full fault definitions. It should define classification views and show how to use them.

Short formulation

The Evaluation-Method View classifies Layer 2 faults by detection strategy.

Layer 2 fault:
  What behavioral failure pattern occurred?

Evaluation method:
  How do we detect, measure, reproduce, or compare it?

Layer 3 control:
  What system design prevents, constrains, recovers from, or monitors it?

The central rule is:

Evaluation methods are views over fault modes, not fault modes themselves.