Model Registry & Providers

ContextRouter never instantiates LLM provider SDKs directly. All AI generation is routed through the unified Model Registry.

Model Keys

Models are selected by registry key: "<provider>/<name>"

Provider	Key Pattern	Modalities
Vertex AI	`vertex/gemini-2.5-flash`	Text + Image + Audio + Video
OpenAI	`openai/gpt-5-mini`	Text + Image + Audio (ASR)
Anthropic	`anthropic/claude-sonnet-4`	Text + Image
Groq	`groq/llama-3.3-70b-versatile`	Text + Image + Audio (ASR)
Perplexity	`perplexity/sonar`	Text
OpenRouter	`openrouter/openai/gpt-5.1`	Text + Image
RLM	`rlm/gpt-5-mini`	Text
Local (Ollama)	`local/llama3.1`	Text + Image
Local (vLLM)	`local-vllm/meta-llama/Llama-3.1-8B-Instruct`	Text + Image
RunPod	`runpod/custom-model`	Text + Image
HuggingFace	`hf/distilgpt2`	Task-dependent
HF Hub	`hf-hub/model-name`	Task-dependent
OpenAI Batch	`openai-batch/gpt-5-mini`	Text

Multimodal Interface

All providers share a unified multimodal contract:

from contextunity.router.modules.models.types import ModelRequest, TextPart, ImagePart

# Text-only request
request = ModelRequest(
    parts=[TextPart(text="Hello, world!")],
    system="You are a helpful assistant",
    temperature=0.7,
)

# Multimodal request (text + image)
request = ModelRequest(
    parts=[
        TextPart(text="What's in this image?"),
        ImagePart(mime="image/jpeg", data_b64="...", uri="https://example.com/image.jpg"),
    ]
)

Fallback System

Strategies

Strategy	Behavior	Streaming
`fallback` (sequential)	Try candidates in order	Same — never switch mid-stream
`parallel`	Run all concurrently, return first success	Falls back to sequential
`cost-priority`	Same as `fallback`; order cheapest → most expensive	Sequential

Error Handling

# Quota exhaustion → immediate fallback (no retries)
except ModelQuotaExhaustedError:
    continue

# Rate limiting → fallback with delay
except ModelRateLimitError:
    continue

Project vs Global Fallback

Project: Specifies fallback_keys in contextunity.project.yaml manifest — per-node control
Global: CU_ROUTER_ALLOW_GLOBAL_FALLBACK=true + CU_ROUTER_FALLBACK_LLMS — safety net

If a node exhausts its fallback_keys and no global fallback is configured, the request fails gracefully.

Model Type Hierarchy

BaseModel (Identity: provider, model_name, model_key, _provider_info)
├── BaseLLM (Text generation: generate(), stream(), retry loop)
│   ├── OpenAILLM, AnthropicLLM, VertexLLM, GroqLLM, ...
│   └── FallbackModel (orchestrator: budget_usd, candidate sequencing)
└── BaseEmbeddings (Vector: embed_text(), embed_batch())

BaseLLM centralises:

Auto cost estimation — calls estimate_cost() when provider returns total_cost=None
ProviderInfo auto-attach — no per-provider boilerplate
Retry loop — backoff, trigger classification, timeout budget

Retry & Cost Governance

Three-tier retry hierarchy

Tier	Source	Example
Global	`policy.models.retry`	Default for all model types
Per-type	`policy.models.llm.retry`	Override for LLMs specifically
Per-node	`node.config.retry`	Highest priority, e.g. JSON-output nodes

policy:
  models:
    retry:
      max_attempts: 2
      backoff: exponential
      timeout_sec: 30
    llm:
      retry:
        retry_on: [rate_limit, timeout, network, response_format]
        max_attempts: 3

nodes:
  - name: planner
    config:
      output_format: "json"
      retry:
        max_attempts: 3
        retry_on: [rate_limit, timeout, network, response_format]

Error taxonomy

Error	Retryable	Fallbackable	Trigger key
`ConnectionError` / 5xx	✅	✅	`network`
`ModelTimeoutError`	✅	✅	`timeout`
`ModelRateLimitError`	✅	✅	`rate_limit`
`ModelResponseFormatError`	✅ (opt-in)	✅	`response_format`
`ModelQuotaExhaustedError`	❌	✅	—
`ModelBudgetExceededError`	❌	❌ (hard stop)	—

Per-node cost budget

Set budget_usd in ModelsPolicy to cap total cost across all model candidates (primary + fallbacks + retries). Exceeding the budget triggers ModelBudgetExceededError — a hard stop with no fallback.

policy:
  models:
    budget_usd: 0.50
    llm:
      default: openai/gpt-5-mini
      fallback: [vertex/gemini-2.5-flash]

Retry vs Fallback flow

FallbackModel.generate(request, budget_usd=0.50)
  └── for each candidate:
        ├── model.generate(request, retry_policy=...)
        │     └── BaseLLM retry loop (backoff, trigger check)
        ├── success → track cumulative cost → check budget_usd
        ├── ModelBudgetExceededError → HARD STOP
        ├── catch ModelError → next candidate
        └── all exhausted → ModelExhaustedError

Reasoning Models (gpt-5, o1, o3)

Reasoning models require special handling:

# Use max_completion_tokens, not max_tokens
# Include extra budget for chain-of-thought reasoning
if is_reasoning_model:
    bind_kwargs["max_completion_tokens"] = 8000  # 4k reasoning + 4k response
    # Temperature must be 1 for reasoning model API

RLM (Recursive Language Models)

RLM wraps any base LLM with recursive REPL capabilities. For processing massive contexts (50k+ items) where standard LLMs experience context degradation.

Reference: arXiv:2512.24601 | GitHub

Key Benefits:

GPT-5-mini with RLM outperforms GPT-5 on long-context tasks
Context stored as Python variable, not in prompt
Model can grep, filter, iterate, and recursively analyze
60-70% cost reduction for bulk processing

model = model_registry.create_llm(
    "rlm/gpt-5-mini",
    config=config,
    environment="docker",  # Isolated execution (recommended)
)

Environment	Use Case	Safety
`local`	Development	Low (same process)
`docker`	Production	High (isolated container)
`modal`	Cloud scaling	High

Local Models

vLLM (OpenAI-compatible)

python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct --port 8000

Set LOCAL_VLLM_BASE_URL=http://localhost:8000/v1. Use local-vllm/meta-llama/Llama-3.1-8B-Instruct.

Ollama

ollama serve && ollama pull llama3.1

Set LOCAL_OLLAMA_BASE_URL=http://localhost:11434/v1. Use local/llama3.1.

API Key Resolution

Router resolves keys via two-tier fallback:

Priority	Source	Path
1	Shield	`{tenant}/api_keys/{provider}`
2	Router env	`OPENAI_API_KEY`, etc.