The Model Sprawl Problem

Every production LLM application eventually hits the same inflection point: one model isn’t enough.

You start with GPT-4 for everything. Then you realize you’re paying $60/million input tokens for tasks that a 70B open-source model handles equally well. You add Claude for long-context summarization. Mistral for structured extraction. A fine-tuned Llama for your domain-specific classifier.

Suddenly your codebase is littered with provider-specific SDK calls, each with different retry logic, different auth patterns, different response shapes.

LLM routing architecture diagram

The real cost of multi-model isn’t the inference bill — it’s the engineering overhead of maintaining N different integrations.

What a Router Actually Does

An LLM router sits between your application and the providers. Every request passes through it, and the router decides:

  1. Which model handles this request
  2. Which provider serves that model (you might have Claude on both Anthropic and AWS Bedrock)
  3. What fallback chain activates if the primary choice fails
# Without a router — hardcoded, fragile
if task_type == "summarize" and token_count > 100_000:
    response = anthropic.messages.create(model="claude-3-opus-20240229", ...)
elif task_type == "extract":
    response = openai.chat.completions.create(model="gpt-4-turbo", ...)
elif task_type == "classify":
    response = together.chat.completions.create(model="meta-llama/...", ...)

# With a router — declarative, observable
response = router.completion(
    model=resolve_model(task_type, token_count),
    messages=messages,
)

The second version looks simpler, but the real value is in what resolve_model can do: read from a config that changes without deploys, apply cost ceilings, respect rate limits, and log everything to your observability stack.

The Routing Table Pattern

The core abstraction is a routing table — a mapping from request attributes to model selections:

ROUTING_TABLE = {
    "summarize": {
        "default": "claude-sonnet-4-20250514",
        "long_context": "claude-sonnet-4-20250514",  # 200K window
        "budget": "gpt-4o-mini",
    },
    "extract": {
        "default": "gpt-4-turbo",
        "high_accuracy": "claude-sonnet-4-20250514",
        "budget": "mistral-large-latest",
    },
    "classify": {
        "default": "ft:llama-3-70b:my-org:classifier:abc123",
        "fallback": "gpt-4o-mini",
    },
    "generate": {
        "default": "claude-sonnet-4-20250514",
        "creative": "claude-sonnet-4-20250514",
        "fast": "gpt-4o-mini",
    },
}

This table lives in a config file or database — not in your application code. When Anthropic drops prices or you finish fine-tuning a new model, you update the table. No redeploy needed.

LiteLLM as the Unification Layer

In our stack, LiteLLM serves as the provider abstraction. It normalizes 100+ providers behind OpenAI’s API shape:

import litellm

# Same interface regardless of provider
response = litellm.completion(
    model="anthropic/claude-sonnet-4-20250514",  # or "gpt-4-turbo", "together_ai/meta-llama/..."
    messages=[{"role": "user", "content": prompt}],
    metadata={"request_id": req_id, "task": "summarize"},
)

The metadata field is critical — it flows through to your observability layer (we use Langfuse) so you can track cost-per-task, latency-per-model, and error rates per provider.

Fallback Chains and Circuit Breaking

The routing table defines the happy path. Fallback chains handle everything else:

FALLBACK_CHAINS = {
    "claude-sonnet-4-20250514": ["claude-sonnet-4-20250514", "gpt-4-turbo"],
    "gpt-4-turbo": ["gpt-4-turbo", "claude-sonnet-4-20250514"],
    "ft:llama-3-70b:*": ["gpt-4o-mini"],  # fine-tuned model → generic fallback
}

When the primary model returns a 429 (rate limited), 529 (overloaded), or times out, the router transparently retries with the next model in the chain. The application never sees the failure.

Add a circuit breaker on top: if a provider fails 3 times in 60 seconds, skip it entirely for the next 5 minutes. This prevents cascading timeouts from dragging down your entire system while you wait for the provider to recover.

Cost Controls

The routing layer is also where you enforce budgets:

class CostGuard:
    def __init__(self, daily_limit_usd: float = 100.0):
        self.daily_limit = daily_limit_usd

    async def check(self, model: str, estimated_tokens: int) -> bool:
        today_spend = await get_daily_spend()  # from Langfuse
        estimated_cost = estimate_cost(model, estimated_tokens)

        if today_spend + estimated_cost > self.daily_limit:
            # Downgrade to cheaper model instead of failing
            return False
        return True

When the budget threshold is hit, the router doesn’t reject the request — it downgrades to a cheaper model. The user gets a slightly less capable response instead of an error. This is a product decision encoded in infrastructure.

Observability: The Missing Piece

A router without observability is a black box. Every routed request should produce a trace with:

  • Model selected and why (which routing rule matched)
  • Latency (time to first token, total time)
  • Token counts (input, output, cached)
  • Cost (calculated from token counts × model pricing)
  • Success/failure and fallback activations

We pipe all of this into Langfuse, which gives us dashboards like “cost per task type over time” and “P95 latency by model.” When we see that gpt-4-turbo extract tasks have crept above 8 seconds P95, we know it’s time to evaluate alternatives.

When Not to Route

Not every application needs a router. If you’re using one model, one provider, and your monthly bill is under $500 — just call the API directly. The complexity of a routing layer isn’t free.

Build a router when:

  • You use 2+ providers in production
  • Your monthly LLM spend exceeds $2,000
  • You need fallback resilience (can’t afford downtime when a provider has an outage)
  • You want to A/B test models without code changes

Implementation Checklist

If you’re building this, here’s the sequence that works:

  1. Start with LiteLLM proxy — deploy it as a sidecar or standalone service
  2. Define your routing table in a YAML config, not code
  3. Add Langfuse callbacks for cost and latency tracking from day one
  4. Implement fallback chains for your top 2 most-used models
  5. Add cost guards once you have 2 weeks of spend data to set reasonable thresholds
  6. Circuit breakers come last — you need failure data to tune the thresholds

The whole system can be production-ready in a week. The hard part isn’t the code — it’s deciding which model is “good enough” for each task, and that’s a product conversation, not an engineering one.