Raj, a freelance PM building an AI-powered code reviewer, is facing a bottleneck. His goal is for the AI to review pull requests (PRs) by looking at the entire codebase to catch cross-file architectural issues.

The codebase is 500,000 lines of code. The AI’s context window—the "working memory" it can process in a single request—is 128,000 tokens (roughly 90,000 words).

"Raj," the engineering lead says, "we can't just 'send the codebase.' Even if we could, a full-context request will cost $15 per review and take 45 seconds. We need to decide what to 'forget' and what to 'remember' in every request."

Raj realized that Context Management isn't a technical optimization; it's a Product Strategy. What you include in the context window determines what the AI "sees," and what it sees determines the value it provides.

In AI product management, you aren't just managing features; you're managing the Real-Estate of Relevance.


1. What is the Context Window (And Why Should You Care)?

Think of the LLM like a genius with a very small desk. The model knows everything in its "training data" (its long-term memory), but it can only "think" about what is currently on its desk (the context window).

Every token (word or character) on that desk costs you three things:

  1. Money: Input tokens are expensive.
  2. Latency: More tokens = slower processing.
  3. Accuracy ("Lost in the Middle"): Research shows that LLMs lose focus on information placed in the middle of a large context.

As a PM, your job is to define the Context Curation Strategy—the logic that decides which 5% of your data represents the 95% of the signal the AI needs.


2. The Three Archetypes of Context Management

How you manage the "box" depends on your product's use case.

Archetype A: The "Infinite" Memory (RAG)

You keep everything in a database and only "pull" the relevant chunks into the context window based on the user's query.

  • When to use: Knowledge bases, customer support, large documentation search.
  • PM Decision: How large should the chunks be? 200 words? 1,000 words? (See Feature-to-Feasibility Guide).

Archetype B: The "Sliding" Window

You only keep the most recent interactions in the context. As new information comes in, the oldest information "slides" out of memory.

  • When to use: Chatbots, conversational interfaces.
  • PM Decision: How many turns of history do we keep? If the user refers to something said 20 minutes ago, should the AI remember it?

Archetype C: The "Summary" Chain

Instead of keeping raw history, the AI periodically summarizes the previous context and only carries the summary forward.

  • When to use: Long, complex workflows or ongoing projects.
  • PM Decision: What is the "Critical Information" that must survive the summary?

3. The Economics of the Window: Token Budgeting

In 2026, the PM owns the Token Budget. You must specify how tokens are allocated across the request components.

The Typical Token Stack

  • System Instructions (The Spec): 500 - 2,000 tokens.
  • Grounding Data (The Context): 5,000 - 100,000 tokens.
  • User Query (The Signal): 50 - 500 tokens.
  • Output Buffer (The Result): 500 - 4,000 tokens.

The Strategy: If you spend too many tokens on instructions, you have less room for data. If you use too much data, your latency kills the UX. (See AI Trade-offs Guide).


4. Optimization Technique: Pruning and Ranking

Instead of sending 10 documents, use a Reranker model to send the top 3 most relevant documents.

  • The Workflow: Retrieval (get 20 candidates) → Rerank (order them by similarity) → Curation (send only the top 3 to the context window).
  • The PM's Role: Define the "Relevance Score." What makes a document relevant? Is it the date? The author? The keyword match?

5. The "Context-at-Home" vs. "Context-in-Flight"

Large context windows (e.g., Gemini’s 1M or 2M tokens) are tempting, but they create a Latency Trap.

As a PM, you must decide:

  • Context-at-Home: Pre-calculate the context (e.g., through fine-tuning or vector indexing). Faster, cheaper, but less dynamic.
  • Context-in-Flight: Send everything in the prompt. Slower, more expensive, but highly adaptive to the latest user state.

6. The Prodinja Angle: Autonomous Context Curation

Managing the "Real-Estate of Relevance" is the core of PRD Engine 2 at PMSynapse. Our Context Architect automatically analyzes your data sources and identifies the "Highest-Signal Chunks" for every PRD task.

It predicts the token cost before you ship and identifies where you can use "slimmer" context strategies to reduce latency without sacrificing quality. It ensures that the genius AI has exactly the right papers on its desk—no more, no less.

For the foundational guide on managing the teams that build these infrastructure-heavy features, see the Complete Guide to Stakeholder Management and the AI PM Pillar Guide.


Key Takeaways

  • The Window is Not Infinite: Even if it fits, "Lost in the Middle" and Latency will degrade your product.
  • Tokens are Your Marginal COGS: Every word you send is a business decision.
  • Choose Your Archetype: RAG for knowledge, Sliding Window for chat, Summarization for workflows.
  • Reranking is a High-Value Move: The best context management is "High-Quality Curation" before the AI ever sees the data.
  • PMs Define Relevance: Engineering builds the retrieval; the PM defines what constitutes a "good" piece of context.

References & Further Reading

  1. Lost in the Middle: How Language Models Use Long Contexts (Stanford Research)
  2. Token Economics for AI Product Managers (a16z Blog)
  3. Vector Databases and the Future of Context (TechCrunch)