Perception and Context Management in Coding Agents

Executive summary

In the language of Martin Fowler and recent writing from LangChain, a harness is often treated as “everything around the model”: state, tool execution, feedback loops, constraints, memory, and orchestration.¹² That is operationally useful, but analytically too coarse for coding agents. Classical AI kept several of these functions separate: sensing and acting in a task environment, maintaining an estimated state under partial observability, controlling what remains in working memory, and closing the loop with monitoring and replanning.³⁴⁵

The cleanest distinction for coding agents is this:

Perception is the process by which an agent obtains observations from its world and turns them into a task-relevant state estimate.
Context management is the meta-process that decides what information should occupy the model’s scarce working set: what to include, exclude, rank, compress, cache, persist, re-inject, or isolate across turns and subagents.⁶⁷⁸

In coding work, perception means reading files, listing directories, querying symbols, running tests, inspecting compiler errors, checking logs, or retrieving docs, then forming hypotheses such as “the failing invariant is enforced in module X” or “the regression is probably in the cache invalidation path.” Retrieval, chunking, summarisation, memory files, compaction, and subagent isolation are therefore mostly context-management mechanisms, even when they operate over perceptual material.⁶⁷⁹

The fact that coding-agent “worlds” differ radically across cases does not dissolve this distinction; it explains why the boundary often feels blurry. In a small repository, access and context selection can nearly coincide because the world is small enough to fit into working memory. In a monorepo, or in production debugging, raw access is abundant but usable salience is scarce, so context management dominates.¹⁰¹¹⁶

What changes across worlds is not the conceptual difference, but which stage is hardest: acquisition, interpretation, or salience control.

Definitions and the central distinction

A precise vocabulary helps. In classical agent theory, the agent is situated in a task environment specified by performance measure, environment, actuators, and sensors. Percepts arrive through sensors, and actions go out through actuators.³ In partially observable settings, the agent does not merely store raw percepts; it maintains a belief state or other state estimate summarising what those percepts imply about the world.⁴

In coding agents, the corresponding environment is not just “the codebase.” It is the codebase plus its harness-mediated affordances: shell, search, tests, documentation, runtime telemetry, permissions, memory artefacts, and tool interfaces.¹¹²¹³

That yields a useful three-part distinction:

Access: what the agent can obtain from the world at all.
Interpretation: what those observations mean for the current task.
Salience management: which interpreted items deserve space in the active context, in what order and representation.

The practical mistake is to call all three “perception.” Once that happens, retrieval failures, localisation errors, prompt overflow, stale memory, and poor summarisation are all lumped together. But they are different failure classes, and they need different fixes.

Retrieval and file access improve access.
Better parsers, localisation heuristics, and verification improve interpretation.
Ranking, compression, pinning, persistence, and subagent isolation improve salience management.¹⁴¹⁵¹⁶⁶

The pipeline below captures the distinction:

  flowchart LR
    A[Access<br/>files, tools, logs, docs, runtime telemetry] --> B[Interpretation<br/>parse, localise, infer, hypothesise]
    B --> C[Context management<br/>select, rank, compress, persist, re-inject]
    C --> D[Model inference]
    D --> E[Action<br/>edits, tests, queries, tool calls]
    E --> F[Feedback<br/>compiler, tests, review, runtime signals]
    F --> B
    F --> C

Classical terms and LLM-agent analogues

Classical construct	Classical role	LLM-agent analogue	Why it is not identical to “perception”
Sensors / percepts	Acquire observations from the environment	File reads, `grep`, AST queries, test outputs, logs, web/docs retrieval, MCP resources	These provide raw access only; they do not decide what remains in the working set.³¹³
Belief state / state estimator	Summarise history into a sufficient statistic under partial observability	Issue-localisation hypotheses, scratchpads, structured notes, repo/runtime state summaries	Interpretation sits between raw observations and prompt inclusion.⁴
Working memory	Hold currently active, task-relevant state	The context window plus state objects exposed to the model	This is the managed result of selection, not sensing itself.¹⁷⁶⁸
BDI beliefs / desires / intentions	Separate information about the world, motivations, and commitments	Retrieved facts, task specs, TODO lists, plans, current subgoals	Intentional commitments and planning are downstream of perception.¹⁸
Blackboard + agenda control	Shared workspace plus control over which hypotheses to pursue	Shared scratchpads, tool-result stores, retrieval/reranking policies, planner queues	The control policy over the workspace is context management, not sensing.¹⁹
Cognitive architecture modules	Couple perception, memory, reasoning, and action under fixed interfaces	Tool APIs, memory stores, runtime state, subagents, sandboxes	Architectures separate modules precisely because perception alone is insufficient.¹⁷²⁰²¹
MAPE-K	Monitor, analyse, plan, execute over shared knowledge	Tool observations, diagnosis, patch planning, execution, persistent agent memory	“Monitor” is closest to perception; Analyse, Plan, and Knowledge belong to interpretation and context control.⁵

Classical lineage in AI

Classical AI did not treat perception as “everything before action.” In Artificial Intelligence: A Modern Approach, the PEAS framing explicitly separates sensors and actuators from the rest of the agent design; percepts are what the architecture makes available to the program, while rational behaviour depends on the whole task environment and the agent’s internal programme.³ That is already enough to show why “context” is not just perception: task specification, success criteria, and internal control are distinct from sensing.

The sharper classical analogue is the POMDP. Kaelbling, Littman, and Cassandra argue that in partially observable domains an agent must use memory of previous actions and observations to disambiguate the world; the agent maintains a belief state that summarises prior experience, and this belief state is a sufficient statistic for acting optimally.⁴ For coding agents, this is the right abstraction: the repo’s latent state is not directly visible from any single file read or log line, so the agent must infer and update a working hypothesis of where the problem is and what constraints matter. Perception, then, is not the belief state itself, but the observation-update process that produces it.

The BDI tradition makes another useful cut. Rao and Georgeff characterise beliefs as the informative component of system state and distinguish them from desires and intentions.¹⁸ That maps naturally onto coding agents: retrieved facts and inferred repo state correspond to beliefs; user requests and acceptance criteria correspond to desires; the current patch plan, selected subtask, or active TODO corresponds to intention. Context management is what keeps these categories coherent and available; perception is how the belief component gets updated.

The blackboard architecture sharpens the control point. Hayes-Roth’s blackboard architecture distinguishes domain and control problems, shared knowledge, and solutions; the point is not merely to collect inputs, but to control which hypotheses and knowledge sources fire when.¹⁹ This is close to modern agent harnesses: the hard problem is not only reading files, but deciding which snippets, hypotheses, and tools deserve another turn of reasoning.

Cognitive architectures make the same separation in a different vocabulary. Soar presents intelligence as a task-independent infrastructure with working memory, long-term memory, processing modules, and interfaces to perception and action; its working memory maintains situational awareness, including perceptual inputs and intermediate reasoning results.¹⁷ Broader cognitive-architecture reviews describe architectures as theories of the representations and mechanisms underlying cognition.²⁰ Again, perception supplies inputs, but the architecture must decide what remains active, how it is transformed, and how it supports action.

Finally, MAPE-K from IBM offers a clean engineering decomposition. Monitor collects details from managed resources and turns them into analysable symptoms; Analyse determines whether change is needed; Plan selects a procedure; Execute applies effectors; Knowledge stores shared policies, logs, symptoms, and topology.⁵ For coding agents, perception is closest to Monitor plus the formation of symptoms. Context management sits mainly in Knowledge and in the policies governing Analyse and Plan.

Modern LLM-agent literature

Modern LLM-agent work largely reproduces these older distinctions, though under new names. ReAct explicitly interleaves reasoning traces and actions so that external observations can update the next reasoning step.²² Toolformer learns when to call tools and how to incorporate the results.²³ These are active-perception systems: they do not just wait for input, they take epistemic actions to obtain new information. But once the observation comes back, the system still faces a separate problem: what to keep from the tool output, and how to weave it into subsequent inference.

Retrieval-augmented models make the access/context split even clearer. REALM and RAG augment parametric memory with explicit non-parametric memory accessed during inference.²⁴²⁵ In those systems, retrieval is a mechanism of access to external knowledge; it is not identical to interpretation, and still less to context management. Retrieved passages must still be ranked, filtered, and integrated into generation.

Chain-of-thought belongs on the interpretation side, not the perception side. Wei et al. show that intermediate reasoning traces can improve performance on complex reasoning tasks, and ReAct shows that such reasoning can be interleaved with actions.²⁶²² But chain-of-thought is an internal computational trace, not an environmental observation. In agents, it often becomes part of the future context, which means it can later be managed as context even though it did not originate as perception.

Memory papers move squarely into context management. Reflexion stores verbal feedback in an episodic memory buffer for later trials.²⁷ Voyager grows a skill library and uses execution feedback plus self-verification to refine code.²⁸ MemoryBank, Generative Agents, LongMem, and MemGPT all address the deficit created by finite context windows by adding persistence, retrieval, updating, or multi-tier storage.²⁹³⁰³¹³² None of these works primarily solves perception in the classical sense; they solve how to write, store, recall, and compress information across turns and sessions.

The most explicit bridge to classical AI is CoALA. It proposes a language-agent framework with modular memory components, structured action spaces for external environments and internal memory, and a generalised decision process.²¹ This is essentially the old cognitive-architecture insight translated into LLM terms: agents need distinct mechanisms for memory, action, and control, not just a larger prompt.

A crucial caution from the literature is that self-critique is not a substitute for external grounding. Some work shows gains from iterative self-feedback and self-refinement, and separate work shows that LLMs can perform self-verification in some settings.³³³⁴ But Stechly, Valmeekam, and Kambhampati find substantial collapse for self-critique on reasoning and planning tasks, whereas sound external verification produces much stronger improvements.³⁵ For coding agents, this strongly suggests that tests, compilers, linters, and execution traces are not optional extras; they are central feedback channels in the harness.

Selected core papers and official sources

Title	Year	Short summary	Why relevant here
[Intelligent Agents in Artificial Intelligence: A Modern Approach]³	2021	Defines PEAS, sensors, actuators, rational agents, and task environments	Baseline classical definition of perception and environment
[BDI Agents: From Theory to Practice]¹⁸	1995	Separates beliefs, desires, and intentions and gives an abstract interpreter loop	Useful map from observation to commitment and plan
[A Blackboard Architecture for Control]¹⁹	1985	Treats control over shared knowledge as a first-class AI problem	Direct analogue of context-selection and agenda control
[An Architectural Blueprint for Autonomic Computing]⁵	2005	Formalises Monitor-Analyse-Plan-Execute over shared knowledge	Strong engineering analogue of perception vs control vs memory
[Introduction to the Soar Cognitive Architecture]¹⁷	2022	Describes working memory, long-term memory, learning, perception interfaces, and action	Good cognitive-architecture lens on coding agents
[Chain-of-Thought Prompting Elicits Reasoning in Large Language Models]²⁶	2022	Shows intermediate reasoning traces improve complex reasoning	Clarifies that reasoning traces are internal interpretation, not perception
[ReAct: Synergizing Reasoning and Acting in Language Models]²²	2022	Interleaves reasoning with actions and observations	Canonical active-perception loop for LLM agents
[Toolformer]²³	2023	Trains models to decide when and how to call external tools	Makes access to external functions part of the policy
[Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks]²⁵	2020	Combines parametric and non-parametric memory for generation	Key source for access-via-retrieval vs generation distinction
[Reflexion]²⁷	2023	Uses verbal reinforcement and episodic memory buffers across trials	Memory/feedback as context management rather than perception
[Voyager]²⁸	2023	Uses curriculum, skill library, environment feedback, and self-verification	Strong example of memory plus verification in an agent loop
[Cognitive Architectures for Language Agents]²¹	2024	Proposes modular memory and structured action spaces for language agents	Best explicit bridge from classical AI to LLM agents
[Lost in the Middle]¹⁵	2024	Shows long-context models strongly depend on evidence position	Core evidence that salience and ordering matter
[LongBench]³⁶	2024	Multitask benchmark for long-context understanding	Shows long context remains difficult and retrieval sometimes helps
[RULER]³⁷	2024	More demanding synthetic benchmark for long-context models	Shows claimed context size often overstates usable context
[NoLiMa]³⁸	2025	Reduces literal-overlap shortcuts in long-context retrieval tests	Important caution against overestimating retrieval ability
[LLMLingua]¹⁶	2023	Prompt-compression method with high compression and little loss	Important evidence that compression can be useful but must be engineered
[RepoCoder]³⁹	2023	Iterative retrieval-generation for repository-level completion	Direct coding-agent evidence that retrieval helps but must be iterative
[RepoBench]¹⁰	2024	Separates repository-level retrieval, completion, and end-to-end pipeline tasks	Excellent benchmark for disentangling access from integration
[CodeRAG-Bench]¹⁴	2024	Large-scale analysis of retrieval for code generation across sources	Shows retrieval helps, but lexical mismatch and integration remain hard
[SWE-bench]¹¹	2023	Real GitHub issues requiring repo understanding and edits	High-value end-to-end benchmark for coding agents
[EvalPlus]⁴⁰	2023	Strengthens code-generation tests and shows weak tests inflate scores	Important caution for evaluating coding agents and harnesses

For engineering practice rather than academic theory, the most useful official sources are the Fowler/Thoughtworks article on harness engineering, Anthropic’s essays on context engineering and contextual retrieval, OpenAI’s prompt-caching documentation, the official Model Context Protocol specification, and recent LangChain essays on harness and context engineering.¹⁶⁴¹⁴²¹²²⁷

Engineering implications for coding agents

The most useful engineering interpretation of the classical distinction is that a coding-agent harness should be decomposed into at least six functions:

Guidance
Access
Interpretation support
Salience control
Execution
Feedback

Fowler’s coding-agent article frames the user harness in terms of feedforward and feedback guides and sensors.¹ LangChain defines a harness as the code, configuration, and execution logic that add state, tools, and constraints around a raw model.² Anthropic defines context engineering as the iterative curation of the full token state during inference, including system instructions, tools, data, and message history.⁶

Put together, these sources imply that “harness” is best treated as an umbrella over several conceptually separable subsystems.

  flowchart TB
    H[Harness functions]
    H --> G[Guidance<br/>instructions, rules, skills]
    H --> A[Access<br/>tools, resources, retrieval]
    H --> I[Interpretation support<br/>parsers, localisation, structured notes]
    H --> S[Salience control<br/>ranking, chunking, compaction, memory selection]
    H --> X[Execution<br/>shell, editors, sandboxes, effectors]
    H --> F[Feedback<br/>tests, reviews, runtime signals, verifiers]

This taxonomy is a synthesis of current harness and context-engineering practice rather than a single paper’s formalism, but it fits the evidence well.¹⁶⁷

Several concrete techniques belong specifically to context management, not perception.

Chunking

Chunking breaks material into retrieval units sized for relevance and latency trade-offs. Good chunking increases the chance that retrieved units are both semantically meaningful and small enough for efficient downstream use.⁴³

Contextual retrieval

Contextual retrieval tries to preserve enough local context around chunks to reduce failed retrievals. Anthropic reports large reductions in failed retrievals, especially when combined with reranking.⁴¹ This is still not perception proper: it is a salience-improving transform over retrievable memory.

Compaction

Compaction summarises prior dialogue or tool history into a smaller continuation state. Anthropic recommends tuning compaction first for recall and only then for precision; Claude Code uses compression plus recently accessed files; LangChain groups this under “compress.”⁶⁸⁷ Again, this is management of working memory under a budget, not sensing.

Persistent memory

Persistent memory stores facts or rules outside the immediate context window. Claude Code documents always-loaded instruction files and auto memory loaded at session start, with some artefacts re-injected after compaction.⁹⁸ MemGPT makes the same design explicit via storage tiers.³² What practitioners often call pinning is best understood as forcing certain artefacts to remain in the active or re-injected working set by policy: system prompts, root-level instruction files, or durable summaries.

Caching

Caching should be distinguished from pinning. Prompt caching in OpenAI and Anthropic documentation reduces latency and cost for repeated prompt prefixes, especially when static content is front-loaded, but caching does not decide relevance; it merely amortises repeated use of already chosen context.⁴²⁴⁴ It is therefore an optimisation over stable context, not a replacement for good selection or compression.

Tool and resource interfaces

Tool interfaces matter because they define what counts as “the world” for the agent. The official Model Context Protocol specification separates resources as context/data from tools as executable functions and places user consent and tool safety at the protocol level.¹²¹³ That separation is conceptually valuable: resources are closer to perception and memory access, tools to actuation, while the harness must still decide what to read, call, and expose to the model.

Cases, failure modes, and evaluation

Small repository

In a small repo, perception and context management are easier to conflate because most of the relevant code can often be read cheaply. The main bottleneck is usually interpretation: localising the bug, understanding conventions, and mapping the issue description to the right symbols and files.

In this regime, richer retrieval systems may help only marginally; strong tests and lightweight persistent project guidance often matter more. RepoCoder’s gains over in-file baselines and Claude Code’s documented memory mechanisms support this: useful cross-file information matters even in simpler settings, but the integration burden is manageable.³⁹⁹

Monorepo and large codebase work

In monorepos, access is no longer the main issue; salience is. The system can list millions of candidate tokens, but the model cannot hold them all in useful working memory. Here, hierarchical retrieval, chunking, selector tools, scratchpads, and subagents become central.

Anthropic’s just-in-time retrieval strategy, LangChain’s write/select/compress/isolate framework, and repo-level benchmarks like RepoBench all point in the same direction: large-codebase success depends on controlling the working set, not simply expanding raw access.⁶⁷¹⁰

Production debugging

Production debugging is the most clearly partially observable coding-agent world. The true causal state is latent; logs, metrics, traces, alerts, and failing tests are noisy proxies. The classical POMDP intuition applies directly: the agent should maintain and revise hypotheses, and it should sometimes take epistemic actions that gather information rather than immediately patching code.⁴

MAPE-K is also a strong analogue here: monitoring creates symptoms, analysis forms diagnoses, planning selects interventions, and execution applies them.⁵ In this setting, thinking of context management as equivalent to perception is especially misleading, because the hard problem is not getting more bytes but maintaining a faithful causal picture under uncertainty.

Architecture and design tasks

Architecture tasks are often perception-light but context-heavy. There may be no failing stack trace or narrow bug location. Instead, the agent must infer constraints from existing conventions, dependency structure, performance goals, and team norms, then preserve those high-level constraints over a long horizon.

Persistent design memory, project rules, feedforward guidance, and architecture-specific verifiers matter more than raw retrieval volume. Fowler’s feedforward/feedback framing is most useful here: architecture work needs guides that encode desired structure and feedback that checks boundary violations.¹

Harness-mediated worlds

The earlier intuition that “the world of coding agents is extremely different across cases” is fundamentally right. A coding agent’s world is not a fixed physical environment; it is partly constructed by the harness.

One harness may expose raw files, shell, logs, and runtime dashboards. Another may expose only summarised resources. One may allow ephemeral web search; another may prohibit it but provide managed memory files. One may return long tool payloads; another may aggressively truncate.

Because of that, what appears as perception in one system may appear as pre-curated context in another. The conceptual distinction still holds, but the implementation boundary moves with the harness.¹²⁶²⁸

Failure modes and mitigation

Access failures

Access failures include missing tools, stale or incomplete indexes, lexical-mismatch retrieval, permissions gaps, and absent runtime observability. CodeRAG-Bench shows that retrievers struggle especially when lexical overlap is weak.¹⁴

Mitigations include hybrid sparse+dense retrieval, contextual retrieval, better chunking, just-in-time file navigation, and richer resource interfaces.⁴¹⁴³⁶¹²

Interpretation failures

Interpretation failures include wrong issue localisation, confusion between symptom and cause, brittle parsing of logs or stack traces, and overconfidence in a single hypothesis.

POMDP-style belief maintenance, structured notes, multiple candidate hypotheses, and external verifiers all help.⁴⁶ The self-verification literature suggests that external, sound checks are more reliable than self-critique alone.³⁵

Salience failures

Salience failures include position bias, context overflow, context poisoning, contradiction between memories, compaction loss, and stale persistent rules.

Lost in the Middle, LongBench, RULER, and NoLiMa all show different facets of this problem: long context is not automatically usable context.¹⁵³⁶³⁷³⁸ Mitigations include stronger reranking, explicit pinning of durable policies, compaction tuned for recall before precision, subagent isolation, and selective clearing of bulky tool results.⁶⁸

Evaluation

A rigorous evaluation stack should measure stages separately as well as end to end.

For access/retrieval, standard IR metrics such as recall@k, MRR, and nDCG remain appropriate, and benchmark suites such as DPR, BEIR, and MTEB show why heterogeneous evaluation matters.⁴⁵⁴⁶⁴⁷

For context use, long-context benchmarks such as LongBench, RULER, and NoLiMa are more informative than raw context-window claims.³⁶³⁷³⁸

For coding outcomes, use task-level measures such as pass@k on HumanEval-style tasks, stricter functional correctness via EvalPlus, repository-level retrieval/completion/pipeline measures via RepoBench, and issue-resolution rates via SWE-bench and SWE-bench Verified.⁴⁸⁴⁰¹⁰¹¹⁴⁹ The Verified subset is especially useful because it sharpens solvability and test quality.

The most important benchmark-design lesson is to separate stage errors. RepoBench does this explicitly with retrieval, completion, and pipeline tasks.¹⁰ Without such decomposition, a system that fails because it cannot retrieve the right file is indistinguishable from one that retrieved perfectly but then compressed away the critical detail, or one that preserved the right detail but made a poor patch. For harness evaluation, that separation is indispensable.

Open problems and research directions

1. Salience modelling

Current systems mostly use heuristic retrieval, ranking, or summarisation. We still lack a robust, task-general theory of which observations should survive into the model’s active context, especially under long-horizon coding work. Existing long-context studies show that more tokens are not enough; what matters is whether the right tokens are made salient in the right representation and position.¹⁵³⁶³⁷³⁸

2. Faithful compression

Compaction and summarisation are necessary for long-running coding agents, but they are also lossy transformations. Anthropic’s guidance to optimise recall first is sound, yet there is still no widely adopted benchmark for “compression fidelity for future software work,” where a fact may look irrelevant now but become critical twenty turns later.⁶⁷

3. Benchmark contamination and realism

SWE-bench is valuable, but coding-agent evaluation remains vulnerable to weak tests, benchmark contamination, and mismatch between benchmark tasks and real IDE workflows.¹¹⁴⁹ EvalPlus already shows how much weak tests can inflate apparent code correctness.⁴⁰ The broader lesson is that harness research needs stronger stage-wise and end-to-end benchmarks grounded in modern, contamination-aware repo settings.

4. Security and trust in tool-mediated worlds

As MCP and related systems standardise resources and tools, the attack surface grows: unsafe tools, misleading tool descriptions, and over-broad permissions make both perception and execution untrustworthy.¹²¹³ Classical AI already warned that action and information gathering are tightly coupled; for coding agents, trustworthy perception increasingly depends on trustworthy interfaces.

Conclusion

For coding agents, perception should mean observation plus state estimation. Context management should mean the control of scarce working memory: selection, ranking, compression, persistence, re-injection, and isolation.

The two are deeply coupled, and different coding worlds shift the practical boundary, but they are not the same thing. Treating them as separate layers yields better theory, better harness design, and better evaluation. Treating them as one undifferentiated “harness” obscures the real engineering trade-offs.¹²⁶¹⁰

Sources

Martin Fowler, “Harness Engineering,” https://martinfowler.com/articles/harness-engineering.html ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
LangChain, “The Anatomy of an Agent Harness,” https://www.langchain.com/blog/the-anatomy-of-an-agent-harness ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Russell and Norvig, Artificial Intelligence: A Modern Approach, Chapter 2, “Intelligent Agents,” https://aima.cs.berkeley.edu/4th-ed/pdfs/newchap02.pdf ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Kaelbling, Littman, and Cassandra, “Planning and Acting in Partially Observable Stochastic Domains,” https://people.csail.mit.edu/lpk/papers/aij98-pomdp.pdf ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
IBM, “An Architectural Blueprint for Autonomic Computing,” https://users.cs.fiu.edu/~sadjadi/Teaching/Autonomic%20Grid%20Computing/CIS-6612-Summer-2006/AC-Blueprint-WhitePaper-V7.pdf ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Anthropic, “Effective Context Engineering for AI Agents,” https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
LangChain, “Context Engineering for Agents,” https://www.langchain.com/blog/context-engineering-for-agents ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Anthropic, “Claude Code: Context Window,” https://code.claude.com/docs/en/context-window ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Anthropic, “Claude Code: Memory,” https://code.claude.com/docs/en/memory ↩︎ ↩︎ ↩︎
Liu et al., “RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems,” https://proceedings.iclr.cc/paper_files/paper/2024/file/d191ba4c8923ed8fd8935b7c98658b5f-Paper-Conference.pdf ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Jimenez et al., “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” https://arxiv.org/abs/2310.06770 ↩︎ ↩︎ ↩︎ ↩︎
Model Context Protocol, “Specification,” https://modelcontextprotocol.io/specification/2025-11-25 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Model Context Protocol, “Server Resources,” https://modelcontextprotocol.io/specification/2025-06-18/server/resources ↩︎ ↩︎ ↩︎ ↩︎
“CodeRAG-Bench: Can Retrieval Augment Code Generation?” https://arxiv.org/abs/2406.14497 ↩︎ ↩︎ ↩︎
Liu et al., “Lost in the Middle: How Language Models Use Long Contexts,” https://arxiv.org/abs/2307.03172 ↩︎ ↩︎ ↩︎ ↩︎
Jiang et al., “LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models,” https://arxiv.org/abs/2310.05736 ↩︎ ↩︎
Laird, Gluck, Anderson, Forbus, Jenkins, Lebiere, Salvucci, Scheutz, Thomaz, Trafton, Wray, Mohan, and Kirk, “Introduction to the Soar Cognitive Architecture,” https://arxiv.org/pdf/2205.03854 ↩︎ ↩︎ ↩︎ ↩︎
Rao and Georgeff, “BDI Agents: From Theory to Practice,” https://cdn.aaai.org/ICMAS/1995/ICMAS95-042.pdf ↩︎ ↩︎ ↩︎
Hayes-Roth, “A Blackboard Architecture for Control,” https://www.sciencedirect.com/science/article/pii/0004370285900633 ↩︎ ↩︎ ↩︎
Taatgen and Anderson, “Why Do Children Learn to Say ‘Broke’? A Model of Learning the Past Tense Without Feedback,” ACT-R-related cognitive architecture source, https://act-r.psy.cmu.edu/wordpress/wp-content/uploads/2012/12/849taatgen.pdf ↩︎ ↩︎
Sumers et al., “Cognitive Architectures for Language Agents,” https://arxiv.org/abs/2309.02427 ↩︎ ↩︎ ↩︎
Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models,” https://arxiv.org/abs/2210.03629 ↩︎ ↩︎ ↩︎
Schick et al., “Toolformer: Language Models Can Teach Themselves to Use Tools,” https://arxiv.org/abs/2302.04761 ↩︎ ↩︎
Guu et al., “REALM: Retrieval-Augmented Language Model Pre-Training,” https://arxiv.org/abs/2002.08909 ↩︎
Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” https://arxiv.org/abs/2005.11401 ↩︎ ↩︎
Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” https://arxiv.org/abs/2201.11903 ↩︎ ↩︎
Shinn et al., “Reflexion: Language Agents with Verbal Reinforcement Learning,” https://arxiv.org/abs/2303.11366 ↩︎ ↩︎
Wang et al., “Voyager: An Open-Ended Embodied Agent with Large Language Models,” https://arxiv.org/abs/2305.16291 ↩︎ ↩︎
Zhong et al., “MemoryBank: Enhancing Large Language Models with Long-Term Memory,” https://arxiv.org/abs/2305.10250 ↩︎
Park et al., “Generative Agents: Interactive Simulacra of Human Behavior,” https://arxiv.org/abs/2304.03442 ↩︎
Wang et al., “Augmenting Language Models with Long-Term Memory,” https://arxiv.org/abs/2306.07174 ↩︎
MemGPT / Letta Research, https://research.memgpt.ai/; paper PDF: https://arxiv.org/pdf/2310.08560 ↩︎ ↩︎
Madaan et al., “Self-Refine: Iterative Refinement with Self-Feedback,” https://arxiv.org/abs/2303.17651 ↩︎
Weng et al., “Large Language Models Are Better Reasoners with Self-Verification,” https://arxiv.org/abs/2212.09561 ↩︎
Stechly, Valmeekam, and Kambhampati, “On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks,” https://arxiv.org/abs/2402.08115 ↩︎ ↩︎
Bai et al., “LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding,” https://arxiv.org/abs/2308.14508 ↩︎ ↩︎ ↩︎ ↩︎
Hsieh et al., “RULER: What’s the Real Context Size of Your Long-Context Language Models?” https://arxiv.org/abs/2404.06654 ↩︎ ↩︎ ↩︎ ↩︎
“NoLiMa: Long-Context Evaluation Beyond Literal Matching,” https://arxiv.org/abs/2502.05167 ↩︎ ↩︎ ↩︎ ↩︎
Zhang et al., “RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation,” https://arxiv.org/abs/2303.12570 ↩︎ ↩︎
Liu et al., “Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation,” https://arxiv.org/abs/2305.01210 ↩︎ ↩︎ ↩︎
Anthropic, “Contextual Retrieval,” https://www.anthropic.com/news/contextual-retrieval ↩︎ ↩︎ ↩︎
OpenAI, “Prompt Caching,” https://developers.openai.com/api/docs/guides/prompt-caching ↩︎ ↩︎
Pinecone, “Chunking Strategies for LLM Applications,” https://www.pinecone.io/learn/chunking-strategies/ ↩︎ ↩︎
Anthropic, “Prompt Caching,” https://platform.claude.com/docs/en/build-with-claude/prompt-caching ↩︎
Karpukhin et al., “Dense Passage Retrieval for Open-Domain Question Answering,” https://arxiv.org/abs/2004.04906 ↩︎
Thakur et al., “BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models,” https://arxiv.org/abs/2104.08663 ↩︎
Muennighoff et al., “MTEB: Massive Text Embedding Benchmark,” https://arxiv.org/abs/2210.07316 ↩︎
Chen et al., “Evaluating Large Language Models Trained on Code,” https://arxiv.org/abs/2107.03374 ↩︎
SWE-bench Verified, https://www.swebench.com/verified.html ↩︎ ↩︎

Part 0

Product Fault Families

Ai engineering

Tech Foundations

Title here

Perception and Context Management in Coding Agents

Executive summary

Definitions and the central distinction

Classical terms and LLM-agent analogues

Classical lineage in AI

Modern LLM-agent literature

Selected core papers and official sources

Engineering implications for coding agents

Chunking

Contextual retrieval

Compaction

Persistent memory

Caching

Tool and resource interfaces

Cases, failure modes, and evaluation

Small repository

Monorepo and large codebase work

Production debugging

Architecture and design tasks

Harness-mediated worlds

Failure modes and mitigation

Access failures

Interpretation failures

Salience failures

Evaluation

Open problems and research directions

1. Salience modelling

2. Faithful compression

3. Benchmark contamination and realism

4. Security and trust in tool-mediated worlds

Conclusion

Sources

Perception and Context Management in Coding Agents

Executive summary#

Definitions and the central distinction#

Classical terms and LLM-agent analogues#

Classical lineage in AI#

Modern LLM-agent literature#

Selected core papers and official sources#

Engineering implications for coding agents#

Chunking#

Contextual retrieval#

Compaction#

Persistent memory#

Caching#

Tool and resource interfaces#

Cases, failure modes, and evaluation#

Small repository#

Monorepo and large codebase work#

Production debugging#

Architecture and design tasks#

Harness-mediated worlds#

Failure modes and mitigation#

Access failures#

Interpretation failures#

Salience failures#

Evaluation#

Open problems and research directions#

1. Salience modelling#

2. Faithful compression#

3. Benchmark contamination and realism#

4. Security and trust in tool-mediated worlds#

Conclusion#

Sources#

Executive summary

Definitions and the central distinction

Classical terms and LLM-agent analogues

Classical lineage in AI

Modern LLM-agent literature

Selected core papers and official sources

Engineering implications for coding agents

Chunking

Contextual retrieval

Compaction

Persistent memory

Caching

Tool and resource interfaces

Cases, failure modes, and evaluation

Small repository

Monorepo and large codebase work

Production debugging

Architecture and design tasks

Harness-mediated worlds

Failure modes and mitigation

Access failures

Interpretation failures

Salience failures

Evaluation

Open problems and research directions

1. Salience modelling

2. Faithful compression

3. Benchmark contamination and realism

4. Security and trust in tool-mediated worlds

Conclusion

Sources