Representation Design is the process of deriving searchable access surfaces from source material so that different task-shaped information needs can reach the right evidence.
The user request is not the retrieval query; the system first derives a task context requirement, then matches that requirement against searchable representations. Your current docs already point toward this with “chunks, summaries, entities, metadata, tables, synthetic questions, relationships, or structured fields,” but the section can make this a design method rather than a list.
These keep the source close to its original form. They are useful when the system needs grounded evidence, citations, or exact context.
Examples:
document chunks
section-level chunks
paragraph passages
sentence-level evidence units
code blocks
table objects
figure/image regions
page-level units
document outline nodes
These are not just “chunks.” The unit should reflect the evidence shape. A legal clause, a table row, a troubleshooting step, and a code function should not necessarily be represented with the same chunking strategy.
Use when the task requires:
citation
grounding
close reading
answering from source text
exact reference
evidence inspection
These expose exact language and exact identifiers.
Examples:
keyword index
n-gram index
identifier index
SKU / account / ticket / policy ID fields
quoted phrase index
acronym expansion index
synonym dictionary
domain vocabulary map
This matters because semantic search often weakens exactness. A task such as “find policy SEC-17B” or “where is this error code mentioned?” should not rely only on embeddings.
Use when the task requires:
exact match
known-item lookup
identifier search
error-code search
legal or policy term matching
auditability
dense embeddings of passages
dense embeddings of sections
hybrid lexical-semantic representations
concept embeddings
domain-adapted embeddings
task-specific embeddings
These are useful when the user does not know the corpus vocabulary. They are weaker when authority, exactness, versioning, or structured constraints matter.
Use when the task requires:
conceptual discovery
approximate matching
policy/explanation lookup
finding related cases
matching user language to domain language
surface summary: what this section says
conceptual summary: what idea this section represents
operational summary: what the user can do with it
evidentiary summary: what claim this section supports
This is important: a summary is not just compression. It is a new access surface. It lets retrieval hit the gist, scope, or roleof a source unit when the original wording is too detailed or too noisy.
Instead of only representing what the document says, generate representations of how users might look for it.
Examples:
synthetic questions
sample user queries
FAQ-style access points
hypothetical search intents
query paraphrases
problem statements this passage could answer
task descriptions supported by this source
This is a major category because user language and source language often diverge. Your own draft says raw corpus material is rarely searchable in the representation users need; source material is stored in one form, while users search through their own vocabulary, task context, and implicit intent.
Use when the task requires:
matching user phrasing to source material
supporting vague or underspecified queries
bridging novice language to expert corpus language
retrieving answers from documents that do not phrase things as questions
Example:
Source text:
"Employees may submit reimbursement requests within 30 days of travel completion."
Derived synthetic queries:
- "How long do I have to submit travel expenses?"
- "Can I still file a reimbursement after my trip?"
- "What is the deadline for expense claims?"
Documents often contain atomic claims embedded in prose. Extracting them creates a representation suitable for verification, comparison, and contradiction detection.
Your draft already identifies graph traversal as a possible retrieval mechanism and graph structure as a corpus representation, but this can be elevated into a full representation family.
This matters because many retrieval failures come from selecting stale or duplicated material, not from generation failure.
13. Authority, provenance, and trust representations#
These expose whether a source should be trusted.
Examples:
source owner
approval status
policy hierarchy
source type
canonicality
citation metadata
provenance chain
access permissions
review status
confidence score
Use when the task requires:
verify authority
choose between conflicting sources
cite evidence
respect permissions
support auditability
This connects directly to your broader context model: acquired context may be stale, conflicting, unauthorized, or only partially relevant, so selection and structuring must preserve signals needed for trust.
The section could say that representations are derived along several dimensions:
1. Unit
What is being represented?
passage, section, document, table row, entity, claim, rule, event, workflow step
2. Transformation
How is it derived?
chunking, summarization, extraction, normalization, abstraction, generation, linking
3. Access mode
How will it be searched?
keyword, vector, hybrid, filter, graph traversal, SQL, table search, entity lookup
4. Task fit
What task need does it serve?
find, answer, verify, compare, summarize, decide, extract, route, act
5. Quality signals
What makes it safe to use?
provenance, authority, freshness, permissions, confidence, contradiction markers
6. Consumer fit
Who uses it downstream?
LLM, reranker, planner, verifier, workflow engine, UI, evaluator
That gives you a more systematic replacement for the shallow table.
You could replace the table with something like this:
Representation Design
A corpus should not be treated as one flat collection of embeddable text. Source material contains many latent structures: terms, entities, facts, claims, rules, tables, relationships, versions, authority signals, procedures, and examples of possible user intent. Representation design is the process of making these latent structures explicit and searchable.
A representation is therefore not merely a storage format. It is an access surface: a derived view of source material that supports a particular class of task-shaped information need.
Representations can be derived through several transformations:
Different representations expose different evidence shapes. A system that only embeds chunks exposes semantic similarity, but may fail on exact identifiers, tables, versions, authority, relationships, policy rules, or user phrasing. A stronger retrieval layer derives multiple representations from the same corpus and routes each task context requirement to the representation most likely to satisfy it.
The design question is not “which index should we build?” but:
What must be made searchable so that this class of task can obtain sufficient,
trustworthy, fresh, and usable context?
That would make the section more structural and less like an example table.