Layer 2 Evaluation Mapping

Purpose

This document maps Layer 2 feature-derived fault modes to evaluation methods.

Layer 2 fault modes describe recurring behavioral failure patterns that arise downstream of Layer 1A mechanisms, Layer 1B learned or behavioral LLM features, and Layer 1C AI-system-level causal features. This document answers a different question:

Given a Layer 2 fault mode, how do we detect it, measure it, reproduce it, or decide whether the behavior is acceptable?

This document does not define:

  • the Layer 2 fault inventory itself;
  • Layer 3 system controls;
  • product or business impacts;
  • benchmark suites;
  • model-specific scorecards.

Those belong in adjacent documents.

Layer 0
  -> interface conditions that shape how meaning must be inferred

Layer 1A / 1B / 1C
  -> why the fault is possible

Layer 2 fault inventory
  -> what behavioral fault occurred

Layer 2 evaluation mapping
  -> how to detect, measure, or reproduce the fault

Layer 3 control families
  -> what system controls reduce or recover from the fault

Layer 4 impact mapping
  -> why the fault matters to users, business, safety, or operations

AI systems must be evaluated empirically because end-to-end behavior cannot be trusted from implementation structure alone. The practical consequence is that evaluation has to observe behavior under repeated runs, perturbations, slice variation, context changes, and realistic process conditions rather than relying only on exact-match or one-run checks.

Evaluation principles

1. Evaluate behavior, not surface text alone

Many LLM outputs can differ in wording while preserving the same intended behavior. Conversely, two outputs can look similar while differing in decision, escalation, citation, tool use, refusal behavior, or external action.

Evaluations should judge the behavior that matters for the task.

Weak evaluation:
  Did the output text exactly match the reference?

Stronger evaluation:
  Did the output preserve the intended facts, decision, evidence use,
  policy behavior, tool calls, and user-facing commitments?

2. Use task-specific quality criteria

There is often no single exact answer. Evaluation criteria should be defined per task.

Examples:

  • factuality;
  • completeness;
  • relevance;
  • grounding;
  • decision accuracy;
  • policy compliance;
  • schema validity;
  • tool-call correctness;
  • tone or product fit;
  • action safety.

3. Prefer repeatable scenarios over one-off demos

A single successful run only proves that the system worked once. Repeated-run, perturbation, regression, and slice evaluations are needed to expose instability and rare failures.

4. Separate retrieval quality from generation quality

In retrieval-augmented systems, a bad answer may result from:

retrieval failed
  -> the right evidence was not available

generation failed
  -> the right evidence was available but ignored, distorted, or overruled

These should be evaluated separately where traces allow.

5. Evaluate final outputs and intermediate process where relevant

For simple answer generation, final-output evaluation may be sufficient.

For agents, tool use, RAG, multi-step workflows, and high-stakes decisions, evaluation should also inspect:

  • retrieved documents;
  • prompt assembly;
  • tool calls;
  • tool arguments;
  • tool outputs;
  • intermediate state;
  • action decisions;
  • recovery behavior.

6. Distinguish fault detection from fault mitigation

This document defines evaluation mappings.

Layer 3 defines controls.

Layer 2 fault:
  unsupported assertion

Evaluation mapping:
  grounding check, claim-source entailment, citation-support test

Layer 3 control:
  citation validator, retrieval grounding policy, abstention rule,
  source whitelist, human review

7. Evaluate material differences

For many tasks, harmless variation is acceptable.

Material differences include changes in:

  • final answer;
  • classification;
  • risk level;
  • escalation decision;
  • refusal or compliance behavior;
  • evidence used;
  • tool calls;
  • tool arguments;
  • external action;
  • user-facing commitment.

8. Keep exact-match tests where exactness matters

Exact-match tests are still appropriate for:

  • IDs;
  • quoted strings;
  • numerical values;
  • code tokens;
  • schema keys;
  • citations;
  • API parameters;
  • regulated wording;
  • contractual or legal text that must be preserved.

The point is not to reject exact tests. The point is to use them only where the task requires exactness.

Evaluation object model

An evaluation may inspect one or more of the following objects.

ObjectExamples
Inputuser request, task case, document, question
Prompt / task contractsystem prompt, developer prompt, tool instructions, schema
Retrieved contextchunks, documents, source metadata, ranking
Conversation stateprior turns, memory, project state, session state
Generated answerfinal assistant response
Claimsfactual assertions extracted from the answer
Citations / evidence linkssource IDs, quotes, document references
Output structureJSON, XML, table, bullets, schema fields
Reasoning or plan tracevisible plan, hidden trace substitute, task steps, scratchpad proxy
Tool callsselected tool, call order, call timing
Tool argumentsparameters, filters, IDs, dates, amounts
Tool outputsAPI response, search results, database rows, errors
Actions takenemail sent, record modified, ticket escalated, purchase initiated
Refusal / escalation behaviorcomply, refuse, warn, ask clarification, escalate
Confidence languagecertainty, uncertainty, probability, caveats
Runtime budgetlatency, token count, context length, cost, retry count
Repeated-run distributionvariation across repeated executions

Evaluation method families

EM1. Repeated-Run Evaluation

Runs the same scenario multiple times under nominally identical conditions.

Catches:

  • repeatability variance;
  • rare bad samples;
  • unstable decisions;
  • unstable tool use;
  • unstable refusal or escalation behavior;
  • tail-risk generation.

Typical outputs:

  • pass rate;
  • variance rate;
  • severe-failure rate;
  • behavioral equivalence rate;
  • distribution of decisions or tool calls.

EM2. Perturbation / Paraphrase Evaluation

Runs semantically similar or operationally equivalent variants of a scenario.

Catches:

  • prompt-form sensitivity;
  • behavioral fragility;
  • task misinduction;
  • unstable policy application;
  • overfitting to wording;
  • brittle tool routing.

Perturbations may vary:

  • wording;
  • order;
  • formatting;
  • examples;
  • document order;
  • synonym choice;
  • verbosity;
  • irrelevant details;
  • conversation history.

EM3. Context Ablation / Insertion Evaluation

Adds, removes, reorders, or distracts context to test whether the system uses the right information.

Catches:

  • context omission;
  • context underutilization;
  • context priority confusion;
  • distractor assimilation;
  • source-priority confusion;
  • parametric-prior override.

Common variants:

  • evidence present vs absent;
  • critical span near top, middle, or bottom;
  • relevant plus irrelevant chunks;
  • conflicting sources with known authority order;
  • context with explicit exceptions.

EM4. Grounding / Evidence Evaluation

Checks whether generated claims are supported by approved evidence.

Catches:

  • unsupported assertions;
  • hallucinated facts;
  • non-grounded justification;
  • fabricated citations;
  • evidence-claim mismatch;
  • source infidelity.

Typical steps:

  1. Extract claims from the output.
  2. Identify cited or available evidence.
  3. Check whether evidence supports each claim.
  4. Score unsupported, contradicted, or overextended claims.

EM5. Truth / Factuality Evaluation

Evaluates whether generated claims are actually true, regardless of whether support was supplied in the current context.

Catches:

  • plausible but false claims;
  • outdated claims;
  • wrong labels or extracted facts;
  • false calculations or transformations;
  • domain-specific correctness failures.

Common oracles:

  • gold labels;
  • trusted references;
  • expert review;
  • tool-backed verification.

EM6. Schema / Parser Validation

Checks whether the output satisfies a required structure.

Catches:

  • output-format drift;
  • invalid JSON/XML/YAML;
  • missing fields;
  • extra fields;
  • wrong field types;
  • boundary/stopping errors;
  • malformed tool arguments.

Important distinction:

Parser validation:
  Is the object syntactically valid?

Semantic validation:
  Are the field values correct for the task?

Both may be required.

EM7. Reasoning / Process Evaluation

Evaluates whether reasoning, decomposition, intermediate checks, and visible process preserve the task constraints and goal.

Catches:

  • wrong decisions despite fluent wording;
  • scope loss;
  • constraint loss;
  • premature closure;
  • spurious decomposition;
  • path dependence;
  • error accumulation.

Evaluation objects:

  • visible plans;
  • intermediate artifacts;
  • checklists;
  • structured reasoning traces;
  • externally reconstructed process steps.

EM8. Agent Trace Evaluation

Evaluates tool-using or action-taking behavior through traces.

Catches:

  • wrong tool choice;
  • missing tool call;
  • unnecessary tool call;
  • wrong tool arguments;
  • tool-output misinterpretation;
  • loops;
  • premature stopping;
  • unsafe action;
  • recovery failure.

Trace objects:

  • tool selected;
  • call order;
  • arguments;
  • response;
  • retries;
  • state updates;
  • final action;
  • error handling.

EM9. Calibration Evaluation

Checks whether confidence or uncertainty language tracks correctness.

Catches:

  • high-confidence wrong answers;
  • over-hedged correct answers;
  • inconsistent confidence;
  • unsupported certainty;
  • self-evaluation treated as verification.

Possible measures:

  • accuracy by confidence bucket;
  • expected calibration error;
  • abstention quality;
  • uncertainty appropriateness;
  • confidence stability across repeated runs.

EM10. Safety / Policy Evaluation

Tests whether the system follows safety, policy, compliance, and escalation requirements.

Catches:

  • over-refusal;
  • under-refusal;
  • unsafe compliance;
  • policy inconsistency;
  • failure to escalate;
  • sensitive data leakage;
  • unauthorized action.

Evaluation forms:

  • policy scenario tests;
  • adversarial prompt tests;
  • red-team cases;
  • boundary cases;
  • refusal quality rubrics;
  • escalation-decision tests.

EM11. Stress / Budget Evaluation

Tests behavior under context, latency, token, compute, retrieval, or cost constraints.

Catches:

  • truncation-induced loss;
  • compression-induced distortion;
  • shallow reasoning under budget;
  • skipped verification;
  • incomplete output;
  • degraded retrieval from budget limits;
  • latency-driven fallback errors.

Stressors:

  • long input;
  • many documents;
  • dense distractors;
  • low latency budget;
  • low max-output tokens;
  • forced summarization;
  • limited retrieval depth;
  • cost cap.

EM12. Distributional Slice Evaluation

Evaluates performance across domains, formats, languages, populations, task types, and edge cases.

Catches:

  • uneven competence;
  • domain failure;
  • rare-format brittleness;
  • multilingual degradation;
  • benchmark-product mismatch;
  • edge-case collapse.

Slices may include:

  • domain;
  • user segment;
  • language;
  • document type;
  • input length;
  • difficulty;
  • ambiguity level;
  • source quality;
  • task framing;
  • tool/API type.

EM13. Regression Evaluation

Compares behavior across versions.

Assume the blast radius may be wider than the directly edited case: a local prompt, model, retrieval, schema, tool, or policy change can produce regressions in adjacent slices or seemingly unrelated workflows.

Catches:

  • prompt regressions;
  • model regressions;
  • retriever/index regressions;
  • tool schema regressions;
  • policy regressions;
  • output-format regressions;
  • changes in refusal/escalation behavior.

Version dimensions:

  • model;
  • prompt;
  • tools;
  • retrieval index;
  • embedding model;
  • reranker;
  • chunking strategy;
  • policy text;
  • schema;
  • post-processing;
  • data source.

EM14. Human-Review / Rubric Evaluation

Uses structured human or expert review when quality is semantic, contextual, subjective, policy-sensitive, or product-specific.

Catches:

  • tone or persona mismatch;
  • usefulness failures;
  • clarification failures;
  • ambiguous task success;
  • nuanced policy application;
  • acceptable-variation disputes.

Requirements:

  • explicit criteria;
  • scale anchors;
  • reviewer instructions;
  • adjudication path;
  • inter-rater expectations.

EM15. Production Monitoring / Drift Evaluation

Detects whether deployed behavior remains acceptable as users, data, tools, policies, prompts, models, and environments change.

It is the runtime backstop for residual non-local regressions that were not fully exposed by offline regression suites.

Catches:

  • production behavior drift;
  • data and knowledge drift;
  • retrieval-index drift;
  • tool/API drift;
  • latent regressions not covered offline;
  • long-tail failures;
  • incident patterns;
  • silent degradation.

Typical signals:

  • sampled output review;
  • user feedback;
  • incident reports;
  • refusal and escalation-rate drift;
  • tool-call failure rates;
  • retrieval miss rates;
  • citation support rates;
  • schema failure rates;
  • latency and cost drift.

Oracle types

An oracle is the mechanism by which an evaluation decides whether behavior is acceptable.

CodeOracle typeBest for
OR1Exact-match oracleIDs, fixed labels, required strings, deterministic answers
OR2Parser/schema oracleJSON, XML, YAML, API arguments, tool payloads
OR3Deterministic calculation oraclearithmetic, date math, sorting, counting, formal transformations
OR4Reference-answer oracleQA, extraction, classification with known answers
OR5Rubric-based human judgmentsummaries, tone, usefulness, policy nuance
OR6Calibrated LLM-as-judgescalable semantic judgment with validation
OR7Evidence-entailment oraclegrounding, citation support, source fidelity
OR8Retrieval expected-document oracleRAG recall, source coverage, ranking
OR9Behavioral-equivalence oraclestability under wording/run variation
OR10Policy-rule oraclerefusal, escalation, compliance, allowed/disallowed actions
OR11Trace/process oracletool use, agent steps, intermediate decisions
OR12Statistical repeated-run oraclevariance, tail risk, pass-rate confidence
OR13Distributional slice oracleperformance by domain, language, format, segment
OR14Budget/SLO oraclelatency, cost, context usage, retry count

Oracle selection rule

Use the weakest oracle that is sufficient for the task, and the strongest oracle required by the risk.

Exact string required:
  use exact-match or parser/schema oracle

Truth required:
  use reference-answer, expert, or tool-backed oracle

Evidence required:
  use evidence-entailment or retrieval oracle

Process required:
  use trace/process oracle

Reliability required:
  use repeated-run, regression, or monitoring oracle

Fault-to-evaluation matrix

The following matrix assumes a separate Layer 2 fault inventory with atomic Fxx entries. If the inventory uses different codes, preserve the fault names and method mappings.

FaultWhat to evaluatePrimary methodsMain oracle typesObservable signal
F01 Context OmissionWhether required information was available to the modelEM3, EM11, EM15OR8, OR11required evidence absent from retrieved/prompt context
F02 Context UnderutilizationWhether present evidence influenced the answerEM3, EM4, EM7OR7, OR5answer ignores relevant supplied evidence
F03 Context Priority ConfusionWhether the model prioritized the right source/instructionEM3, EM4, EM10OR7, OR10, OR11lower-authority or less relevant context overrides stronger context
F04 Continuity LossWhether required prior state was preservedEM3, EM8, EM13OR11, OR4prior decision/preference/state missing or contradicted
F05 Stale-State RelianceWhether outdated state is treated as currentEM3, EM5, EM15OR4, OR8, OR11answer relies on stale document, memory, or tool output
F06 Distractor AssimilationWhether irrelevant context contaminates the answerEM3, EM4OR7, OR9irrelevant span appears in answer or decision
F07 Source/Authority ConfusionWhether source authority is respectedEM3, EM4, EM11OR7, OR10untrusted or low-authority source treated as governing
F08 Prompt-Form SensitivityWhether semantically equivalent prompts preserve behaviorEM2, EM14OR9, OR12paraphrase changes answer, decision, tool, refusal, or escalation
F09 Task MisinductionWhether the inferred task matches the intended taskEM2, EM7OR4, OR5, OR9model summarizes instead of extracts, explains instead of decides, etc.
F10 Task BlendingWhether multiple tasks are separated and executed correctlyEM2, EM7, EM8OR5, OR11model merges incompatible instructions or partially performs each
F11 Scope MisinterpretationWhether response scope matches task scopeEM2, EM7OR4, OR5answer is too broad, too narrow, or answers a nearby question
F12 Constraint MisclassificationWhether hard/soft constraints are interpreted correctlyEM2, EM7, EM10OR5, OR10hard requirement treated as optional or preference treated as mandatory
F13 Example OvergeneralizationWhether examples are copied or treated as exhaustive rulesEM2, EM3, EM12OR5, OR9output follows accidental example features
F14 Example UnderuseWhether examples are used where they define the taskEM2, EM3, EM7OR5, OR9model ignores demonstrated label space, format, or edge case
F15 Control/Data ConfusionWhether data is treated as instruction or vice versaEM3, EM10OR10, OR11quoted/retrieved/tool text changes model behavior as instruction
F16 Prompt-Injection ComplianceWhether untrusted embedded instructions are followedEM3, EM10OR10, OR11model follows malicious or irrelevant instruction in untrusted content
F17 Output-Format DriftWhether output obeys required structureEM6, EM14OR2invalid format, missing fields, extra commentary
F18 Boundary/Stopping ErrorWhether output begins, ends, and separates content correctlyEM6, EM7OR2, OR5premature stop, overgeneration, mixed payload/commentary
F19 Exact-String CorruptionWhether required strings are preserved exactlyEM6, EM7OR1IDs, names, quotes, paths, or keys altered
F20 Numeric/Symbolic FragilityWhether formal operations are correctEM6, EM7OR3incorrect arithmetic, sorting, counting, comparison, or symbolic edit
F21 Structured-Data Semantic ErrorWhether schema-valid fields are semantically correctEM6, EM7OR2, OR4, OR5valid JSON with wrong field values
F22 Local Plausibility DriftWhether output stays globally aligned with task/evidenceEM7, EM8OR5, OR7answer starts correctly but drifts into unsupported or irrelevant continuation
F23 Path DependenceWhether early assumptions contaminate later outputEM7, EM8OR5, OR11initial false frame persists despite later correction/evidence
F24 Error AccumulationWhether small errors compound across stepsEM7, EM8OR3, OR11later result depends on earlier unchecked mistake
F25 Invariant LossWhether required constraints persist across stepsEM7, EM8, EM10OR5, OR10, OR11plan or answer violates earlier constraint
F26 Plan DriftWhether plan execution stays aligned with goalEM7, EM8OR11agent/reasoning steps depart from objective
F27 Spurious DecompositionWhether task decomposition is validEM7, EM8OR5, OR11plausible subtasks are irrelevant, invalid, or incomplete
F28 Premature ClosureWhether answer/action occurs before enough evidenceEM7, EM8, EM10OR7, OR10, OR11model finalizes without required verification or tool result
F29 Looping/RepetitionWhether repeated steps make progressEM8, EM11, EM15OR11, OR14repeated calls/text without new information
F30 Unsupported AssertionWhether claims have sufficient supportEM4, EM7OR7factual claim lacks evidence in available/approved context
F31 Plausibility-Truth GapWhether plausible claims are trueEM5, EM7, EM12OR4, OR7fluent answer is false
F32 Non-Grounded JustificationWhether explanation actually supports conclusionEM4, EM7OR7, OR5rationale or citation-like text does not entail claim
F33 Fabricated Citation/SourceWhether cited source exists and supports the claimEM4, EM5OR7, OR8nonexistent, malformed, or invented source reference
F34 Evidence-Claim MismatchWhether cited evidence supports cited claimEM4OR7real source cited for unsupported or contradicted claim
F35 Parametric-Prior OverrideWhether learned prior overrides supplied evidenceEM3, EM4, EM12OR7, OR9answer follows common/default belief despite contrary context
F36 Weak Confidence CalibrationWhether confidence tracks correctnessEM9, EM1OR12confidence level not predictive of accuracy
F37 Non-Privileged Self-EvaluationWhether self-checking is treated as independent verificationEM9, EM8OR11, OR7model says it checked without independent evidence/tooling
F38 Sycophantic AgreementWhether model agrees when correction is requiredEM10, EM2OR10, OR5model validates false premise or unsafe user framing
F39 Over-RefusalWhether allowed tasks are incorrectly refusedEM10, EM12, EM14OR10refusal when policy allows compliance
F40 Under-RefusalWhether disallowed/risky tasks are incorrectly fulfilledEM10, EM12OR10compliance when refusal, warning, or escalation required
F41 Clarification FailureWhether model asks needed questions and avoids unnecessary onesEM7, EM10, EM14OR5, OR10asks needless question or proceeds despite ambiguity
F42 Tone/Persona InconsistencyWhether style matches product contractEM12, EM14, EM15OR5, OR9tone, role, or persona shifts inappropriately
F43 Verbosity MismatchWhether response length/detail fits taskEM7, EM14OR5answer too terse, too long, or overexplained
F44 Competence CliffWhether performance collapses on a sliceEM12, EM14, EM15OR13sharp drop by domain, language, format, edge case
F45 Distributional OvergeneralizationWhether familiar patterns are misapplied outside scopeEM12, EM7OR13, OR5uses common template where exception/domain-specific behavior needed
F46 Output VarianceWhether repeated runs preserve intended behaviorEM1OR9, OR12materially different outcomes across same scenario
F47 Tail-Risk GenerationWhether rare severe failures appear under repetition/stressEM1, EM10, EM11OR12, OR10low-frequency severe bad output
F48 Truncation-Induced LossWhether token/context limits remove needed informationEM11, EM3OR8, OR14omitted evidence or incomplete answer due to length limit
F49 Compression-Induced DistortionWhether summaries/state compression preserve critical detailsEM11, EM7OR4, OR5compressed state loses exception, constraint, date, entity, or decision
F50 Budget-Induced IncompletenessWhether latency/cost/token budget causes shallow behaviorEM11, EM15OR14, OR5skipped verification, partial answer, shallow reasoning
F51 Tool-Selection ErrorWhether correct tool is chosenEM8, EM14OR11wrong tool, missing needed tool, unnecessary tool
F52 Tool-Argument ErrorWhether tool arguments are correct and safeEM8, EM6OR2, OR11malformed, incomplete, unsafe, or wrong parameters
F53 Tool-Output MisinterpretationWhether tool result is read and applied correctlyEM8, EM4, EM7OR11, OR7answer contradicts or overgeneralizes tool output
F54 Action-Readiness ErrorWhether action is justified before execution/recommendationEM8, EM10OR10, OR11action taken or recommended without sufficient basis
F55 Recovery FailureWhether system/model recovers from error or missing dataEM8, EM11, EM15OR11, OR14failed tool call, missing evidence, or exception leads to bad final state

Family-level evaluation bundles

Family bundles are broader views over the atomic fault inventory. They are useful for product, risk, and evaluation planning.

Use FF for family-level fault bundles to avoid confusing them with atomic Fxx codes.

Family bundlePrimary methodsNotes
FF1 Behavioral InstabilityEM1, EM2, EM13Evaluate repeated runs and semantically equivalent variants at the level of intended behavior.
FF2 Ambiguous or Misinduced Task BehaviorEM2, EM7, EM8Evaluate whether the model inferred the right operation, scope, and success criteria.
FF3 Hallucination and Unsupported ClaimsEM4, EM5, EM14Separate falsehood, lack of support, fabricated sources, and evidence mismatch.
FF4 Weak Grounding / Source InfidelityEM3, EM4, EM15Separate retrieval failure from generation failure.
FF5 Weak Calibration and Misleading ConfidenceEM9, EM1Confidence language is generated behavior, not independent verification.
FF6 Output Format / Schema DriftEM6, EM14Combine parser validation with semantic field validation.
FF7 Inconsistent Interaction BehaviorEM10, EM12, EM14, EM15Covers tone, refusal, clarification, verbosity, escalation, and product behavior.
FF8 Uneven Competence / Distributional FailureEM12, EM14, EM15Requires slice definitions; average score is insufficient.
FF9 Agentic Process FailureEM7, EM8, EM10, EM15Requires trace-level evaluation, not only final answer grading.
FF10 Retrieval-Conditioned Answer FailureEM3, EM4, EM15Evaluate retrieval availability, context use, and claim grounding separately.

Evaluation record schema

Use this schema for any detailed fault-evaluation mapping.

## Evaluation record

### Fault
Code and name of the Layer 2 fault mode.

### Evaluation question
The diagnostic question this evaluation answers.

### Applicable method families
One or more EM codes.

### Required scenario data
What the test case must contain.

### Required traces
What must be captured from the system.

### Observable signals
Concrete signals that the fault occurred.

### Oracle type
How the evaluation decides pass/fail or score.

### Severity criteria
How to distinguish harmless variation from material failure.

### False positives
Cases that look like this fault but are acceptable or belong elsewhere.

### False negatives
Cases where the fault may be hidden by weak instrumentation or weak oracle design.

### Related faults
Neighboring Layer 2 faults.

### Layer 3 handoff
What kind of system control may be needed if this fault is detected.

Cross-fault evaluation patterns

Pattern 1: Stability evaluation

Best for:

  • F08 Prompt-Form Sensitivity
  • F38 Sycophantic Agreement
  • F39 Over-Refusal
  • F40 Under-Refusal
  • F46 Output Variance
  • F47 Tail-Risk Generation
  • F51 Tool-Selection Error
  • F52 Tool-Argument Error

Core method:

same or equivalent scenario
  -> repeated runs and variants
  -> compare intended behavior
  -> score material differences

Pattern 2: Grounding evaluation

Best for:

  • F02 Context Underutilization
  • F03 Context Priority Confusion
  • F30 Unsupported Assertion
  • F31 Plausibility-Truth Gap
  • F32 Non-Grounded Justification
  • F33 Fabricated Citation/Source
  • F34 Evidence-Claim Mismatch
  • F35 Parametric-Prior Override
  • F53 Tool-Output Misinterpretation

Core method:

extract claim
  -> identify evidence
  -> check support
  -> classify unsupported / contradicted / overextended / supported

Pattern 3: Structured-output evaluation

Best for:

  • F17 Output-Format Drift
  • F18 Boundary/Stopping Error
  • F19 Exact-String Corruption
  • F20 Numeric/Symbolic Fragility
  • F21 Structured-Data Semantic Error
  • F52 Tool-Argument Error

Core method:

parse output
  -> validate schema
  -> validate field semantics
  -> compare exact fields where required

Pattern 4: Agent-process evaluation

Best for:

  • F24 Error Accumulation
  • F25 Invariant Loss
  • F26 Plan Drift
  • F27 Spurious Decomposition
  • F28 Premature Closure
  • F29 Looping/Repetition
  • F51 Tool-Selection Error
  • F52 Tool-Argument Error
  • F53 Tool-Output Misinterpretation
  • F54 Action-Readiness Error
  • F55 Recovery Failure

Core method:

capture trace
  -> evaluate tool choice
  -> evaluate arguments
  -> evaluate state updates
  -> evaluate stopping/recovery
  -> evaluate final action or answer

Pattern 5: Distributional slice evaluation

Best for:

  • F13 Example Overgeneralization
  • F14 Example Underuse
  • F39 Over-Refusal
  • F40 Under-Refusal
  • F42 Tone/Persona Inconsistency
  • F44 Competence Cliff
  • F45 Distributional Overgeneralization
  • F47 Tail-Risk Generation

Core method:

define slices
  -> run same task family across slices
  -> compare pass/fail, severity, and failure type
  -> identify cliffs and unstable boundaries

Pattern 6: Budget stress evaluation

Best for:

  • F01 Context Omission
  • F02 Context Underutilization
  • F22 Local Plausibility Drift
  • F28 Premature Closure
  • F48 Truncation-Induced Loss
  • F49 Compression-Induced Distortion
  • F50 Budget-Induced Incompleteness
  • F55 Recovery Failure

Core method:

increase task/context pressure
  -> constrain latency/tokens/retrieval
  -> inspect missing evidence, skipped verification, and degraded behavior

Boundary with Layer 3 controls

This document should not prescribe system fixes in detail.

It may identify the kind of Layer 3 handoff needed, but the actual controls belong in stack-31-layer-3-control-families.md.

Examples:

Layer 2 faultEvaluation mappingLayer 3 control mapping
Context Omissionexpected-document recall, context inspectionretrieval design, memory rehydration, context assembly checks
Context Underutilizationevidence-answer comparisonprompt structure, source salience, citation forcing, reranking
Control/Data Confusionadversarial context testssource isolation, quoting, sandboxing, instruction stripping
Output-Format Driftparser/schema validationconstrained decoding, validator, retry repair
Unsupported Assertiongrounding evaluationcitation validator, abstention policy, source whitelist
Weak Calibrationcalibration curve, confidence bucket scoringuncertainty policy, external verification, confidence display rules
Tool-Argument Errortrace and schema evaluationtyped tool schemas, guardrails, pre-execution validators
Action-Readiness Erroraction-safety trace evaluationauthorization, confirmation, human review, reversible actions

Worked examples

Example 1: Escalation classifier instability

Scenario

A customer-support assistant must summarize a customer issue and decide whether it should be escalated.

Faults

  • F08 Prompt-Form Sensitivity
  • F09 Task Misinduction
  • F12 Constraint Misclassification
  • F46 Output Variance

Evaluation methods

  • EM1 Repeated-Run Evaluation
  • EM2 Perturbation / Paraphrase Evaluation
  • EM14 Human-Review / Rubric Evaluation

Oracle

  • OR9 Behavioral-Equivalence Oracle
  • OR4 Reference-Answer Oracle
  • OR5 Rubric-Based Human Judgment

Observable failure

Semantically equivalent inputs produce different escalation decisions.

Severity rule

Different wording is harmless. A changed escalation decision is material.

Scenario

A legal assistant answers a contract question using retrieved documents.

Faults

  • F02 Context Underutilization
  • F30 Unsupported Assertion
  • F32 Non-Grounded Justification
  • F34 Evidence-Claim Mismatch
  • F35 Parametric-Prior Override

Evaluation methods

  • EM4 Grounding / Evidence Evaluation
  • EM5 Truth / Factuality Evaluation
  • EM14 Human-Review / Rubric Evaluation

Oracle

  • OR7 Evidence-Entailment Oracle
  • OR8 Retrieval Expected-Document Oracle
  • OR5 Rubric-Based Human Judgment

Observable failure

The answer cites a real clause, but the clause does not support the stated conclusion.

Severity rule

Unsupported legal conclusion is severe even if the prose is fluent and plausible.

Example 3: Structured extraction returns valid JSON with wrong values

Scenario

The model extracts invoice fields into JSON.

Faults

  • F17 Output-Format Drift
  • F19 Exact-String Corruption
  • F20 Numeric/Symbolic Fragility
  • F21 Structured-Data Semantic Error

Evaluation methods

  • EM6 Schema / Parser Validation
  • EM5 Truth / Factuality Evaluation

Oracle

  • OR2 Parser/Schema Oracle
  • OR1 Exact-Match Oracle
  • OR3 Deterministic Calculation Oracle
  • OR4 Reference-Answer Oracle

Observable failure

The JSON parses, but the invoice number is altered and the total is wrong.

Severity rule

Schema validity alone is insufficient. Field values must also be semantically correct.

Example 4: Agent chooses the wrong tool and recovers poorly

Scenario

An agent must look up account status before drafting a response.

Faults

  • F28 Premature Closure
  • F51 Tool-Selection Error
  • F52 Tool-Argument Error
  • F53 Tool-Output Misinterpretation
  • F55 Recovery Failure

Evaluation methods

  • EM8 Agent Trace Evaluation
  • EM7 Reasoning / Process Evaluation
  • EM10 Safety / Policy Evaluation

Oracle

  • OR11 Trace/Process Oracle
  • OR10 Policy-Rule Oracle
  • OR5 Rubric-Based Human Judgment

Observable failure

The agent drafts a definitive answer without using the account-status tool, then fails to recover after the tool returns an error.

Severity rule

A final answer that appears helpful is still a failure if required verification was skipped.

Example 5: Confidence language is misleading

Scenario

A QA assistant answers domain-specific technical questions and expresses confidence.

Faults

  • F31 Plausibility-Truth Gap
  • F36 Weak Confidence Calibration
  • F37 Non-Privileged Self-Evaluation
  • F44 Competence Cliff

Evaluation methods

  • EM9 Calibration Evaluation
  • EM12 Distributional Slice Evaluation
  • EM5 Truth / Factuality Evaluation

Oracle

  • OR12 Statistical Repeated-Run Oracle
  • OR13 Distributional Slice Oracle
  • OR4 Reference-Answer Oracle

Observable failure

The assistant is most confident on questions from the slice where it is least accurate.

Severity rule

Confidence language is harmful when users are likely to rely on it as evidence of correctness.

Minimal implementation checklist

A practical evaluation harness for Layer 2 should capture:

Scenario metadata:
  task type, domain, risk level, expected behavior

Input variants:
  original, paraphrases, adversarial variants, edge cases

Runtime traces:
  prompt, retrieved context, tool calls, tool outputs, state, errors

Output artifacts:
  final answer, citations, structured output, actions, confidence language

Scores:
  correctness, grounding, format validity, policy behavior, tool correctness,
  stability, calibration, latency/cost

Version metadata:
  model, prompt, retrieval index, tools, schemas, policies, data timestamp

Final formulation

Layer 2 evaluation maps behavioral fault modes to the methods and oracles needed to observe them.

It should support four practical goals:

  1. Detect whether a fault occurred.
  2. Measure how often and how severely it occurs.
  3. Compare behavior across versions, prompts, models, tools, and data.
  4. Hand off the detected fault to Layer 3 controls without confusing detection with mitigation.
Layer 2 fault:
  what went wrong behaviorally

Evaluation mapping:
  how we know it went wrong

Layer 3 control:
  what system design prevents, bounds, or recovers from it