The Problem

Multi-tenant LLM applications face a fundamental contradiction. On one side, you want to serve each user a system prompt tailored to their role, their subscription tier, their active skill set — progressive disclosure of capabilities as the user earns or unlocks them. On the other, your inference bill is quietly being destroyed by the fact that KV-cache only works when the prefix is identical across requests.

The moment you inject user.role = "admin" or skills: ["web_search", "code_exec"] into a system prompt, you’ve broken cache reuse for every other user who doesn’t share that exact string. At scale, this isn’t a rounding error — it’s the difference between a $400/month inference bill and a $4,000 one.

The problem isn’t that progressive disclosure is wrong. It’s that injecting dynamic content into the system prompt is the wrong place to do it.

How KV-Cache Actually Works

The KV-cache stores the computed key-value attention matrices for a given token sequence. When the next request begins with the same prefix, the model skips recomputing those tokens entirely — you only pay for the new tokens appended at the end.

Anthropic’s extended context window pricing reflects this: cached input tokens cost roughly 10x less than uncached ones. OpenAI has similar mechanics. The implication is stark: a shared, static system prompt that all users in a tier share is worth engineering around.

/* Naive approach — breaks cache for every user */
const systemPrompt = buildSystemPrompt({
  role: user.role,           // "admin" | "viewer" | "editor"
  skills: user.activeSkills, // dynamic array
  orgName: user.org.name,    // "Acme Corp"
});

/* Every user gets a unique prefix → 0% cache hits */

What Breaks the Cache

Any token that varies between users placed before the stable content will break cache reuse for everything after it. This includes: user names, role labels, dynamic skill lists, org names, timestamps, and session IDs — all common things developers instinctively put at the top of a system prompt.

The Fix: Move Dynamic Content to Tool Results

The insight is that the system prompt doesn’t need to carry the dynamic content. The conversation does.

Keep your system prompt static and role-tier-scoped. Serve dynamic, user-specific context as the first assistant turn or as a synthetic tool result injected before the user’s first message.

This preserves a long, cacheable system prompt prefix while still giving the model the user-specific context it needs. The cache hit window is everything up to the first user message — which is where all the expensive prompt engineering lives.

/* Better approach — static prefix, dynamic tool result */
const messages = [
  {
    role: "user",
    content: [
      {
        type: "tool_result",
        tool_use_id: "ctx_init",
        content: JSON.stringify({
          user_role: user.role,
          active_skills: user.activeSkills,
          org: user.org.name,
        }),
      },
      { type: "text", text: userMessage },
    ],
  },
];

/* System prompt stays identical for all users in tier → cache hits */

The model treats the tool result as factual context about the current session. It works exactly as well as system prompt injection — the model reads it, respects it, and reasons from it. The difference is entirely architectural.

Structuring Tiers for Maximum Cache Reuse

With this pattern established, you can now think about your system prompts as tier-scoped shared resources rather than per-user strings.

  • Free tier — one static system prompt, shared by all free users
  • Pro tier — one static system prompt with expanded capabilities described
  • Enterprise tier — one static system prompt per org vertical

User-specific state (name, preferences, recent context) all flows through the conversation as tool results or injected assistant turns. You get the personalization without the cache fragmentation.


This pattern emerged from a real cost audit on a multi-tenant chatbot running ~50,000 sessions/day. Moving from per-user system prompt injection to tier-scoped static prompts reduced the effective uncached token ratio from 94% to around 18% — a significant reduction in inference spend without any change to user-facing behavior.