Model Registry & Providers
ContextRouter never instantiates LLM provider SDKs directly. All AI generation is routed through the unified Model Registry.
Model Keys
Models are selected by registry key: "<provider>/<name>"
| Provider | Key Pattern | Modalities |
|---|---|---|
| Vertex AI | vertex/gemini-2.5-flash | Text + Image + Audio + Video |
| OpenAI | openai/gpt-5-mini | Text + Image + Audio (ASR) |
| Anthropic | anthropic/claude-sonnet-4 | Text + Image |
| Groq | groq/llama-3.3-70b-versatile | Text + Image + Audio (ASR) |
| Perplexity | perplexity/sonar | Text |
| OpenRouter | openrouter/openai/gpt-5.1 | Text + Image |
| RLM | rlm/gpt-5-mini | Text |
| Local (Ollama) | local/llama3.1 | Text + Image |
| Local (vLLM) | local-vllm/meta-llama/Llama-3.1-8B-Instruct | Text + Image |
| RunPod | runpod/custom-model | Text + Image |
| HuggingFace | hf/distilgpt2 | Task-dependent |
| HF Hub | hf-hub/model-name | Task-dependent |
| OpenAI Batch | openai-batch/gpt-5-mini | Text |
Multimodal Interface
All providers share a unified multimodal contract:
from contextunity.router.modules.models.types import ModelRequest, TextPart, ImagePart
# Text-only requestrequest = ModelRequest( parts=[TextPart(text="Hello, world!")], system="You are a helpful assistant", temperature=0.7,)
# Multimodal request (text + image)request = ModelRequest( parts=[ TextPart(text="What's in this image?"), ImagePart(mime="image/jpeg", data_b64="...", uri="https://example.com/image.jpg"), ])Fallback System
Strategies
| Strategy | Behavior | Streaming |
|---|---|---|
fallback (sequential) | Try candidates in order | Same — never switch mid-stream |
parallel | Run all concurrently, return first success | Falls back to sequential |
cost-priority | Same as fallback; order cheapest → most expensive | Sequential |
Error Handling
# Quota exhaustion → immediate fallback (no retries)except ModelQuotaExhaustedError: continue
# Rate limiting → fallback with delayexcept ModelRateLimitError: continueProject vs Global Fallback
- Project: Specifies
fallback_keysincontextunity.project.yamlmanifest — per-node control - Global:
CU_ROUTER_ALLOW_GLOBAL_FALLBACK=true+CU_ROUTER_FALLBACK_LLMS— safety net
If a node exhausts its fallback_keys and no global fallback is configured, the request fails gracefully.
Model Type Hierarchy
BaseModel (Identity: provider, model_name, model_key, _provider_info)├── BaseLLM (Text generation: generate(), stream(), retry loop)│ ├── OpenAILLM, AnthropicLLM, VertexLLM, GroqLLM, ...│ └── FallbackModel (orchestrator: budget_usd, candidate sequencing)└── BaseEmbeddings (Vector: embed_text(), embed_batch())BaseLLM centralises:
- Auto cost estimation — calls
estimate_cost()when provider returnstotal_cost=None - ProviderInfo auto-attach — no per-provider boilerplate
- Retry loop — backoff, trigger classification, timeout budget
Retry & Cost Governance
Three-tier retry hierarchy
| Tier | Source | Example |
|---|---|---|
| Global | policy.models.retry | Default for all model types |
| Per-type | policy.models.llm.retry | Override for LLMs specifically |
| Per-node | node.config.retry | Highest priority, e.g. JSON-output nodes |
policy: models: retry: max_attempts: 2 backoff: exponential timeout_sec: 30 llm: retry: retry_on: [rate_limit, timeout, network, response_format] max_attempts: 3
nodes: - name: planner config: output_format: "json" retry: max_attempts: 3 retry_on: [rate_limit, timeout, network, response_format]Error taxonomy
| Error | Retryable | Fallbackable | Trigger key |
|---|---|---|---|
ConnectionError / 5xx | ✅ | ✅ | network |
ModelTimeoutError | ✅ | ✅ | timeout |
ModelRateLimitError | ✅ | ✅ | rate_limit |
ModelResponseFormatError | ✅ (opt-in) | ✅ | response_format |
ModelQuotaExhaustedError | ❌ | ✅ | — |
ModelBudgetExceededError | ❌ | ❌ (hard stop) | — |
Per-node cost budget
Set budget_usd in ModelsPolicy to cap total cost across all model candidates (primary + fallbacks + retries). Exceeding the budget triggers ModelBudgetExceededError — a hard stop with no fallback.
policy: models: budget_usd: 0.50 llm: default: openai/gpt-5-mini fallback: [vertex/gemini-2.5-flash]Retry vs Fallback flow
FallbackModel.generate(request, budget_usd=0.50) └── for each candidate: ├── model.generate(request, retry_policy=...) │ └── BaseLLM retry loop (backoff, trigger check) ├── success → track cumulative cost → check budget_usd ├── ModelBudgetExceededError → HARD STOP ├── catch ModelError → next candidate └── all exhausted → ModelExhaustedErrorReasoning Models (gpt-5, o1, o3)
Reasoning models require special handling:
# Use max_completion_tokens, not max_tokens# Include extra budget for chain-of-thought reasoningif is_reasoning_model: bind_kwargs["max_completion_tokens"] = 8000 # 4k reasoning + 4k response # Temperature must be 1 for reasoning model APIRLM (Recursive Language Models)
RLM wraps any base LLM with recursive REPL capabilities. For processing massive contexts (50k+ items) where standard LLMs experience context degradation.
Reference: arXiv:2512.24601 | GitHub
Key Benefits:
- GPT-5-mini with RLM outperforms GPT-5 on long-context tasks
- Context stored as Python variable, not in prompt
- Model can
grep,filter,iterate, and recursively analyze - 60-70% cost reduction for bulk processing
model = model_registry.create_llm( "rlm/gpt-5-mini", config=config, environment="docker", # Isolated execution (recommended))| Environment | Use Case | Safety |
|---|---|---|
local | Development | Low (same process) |
docker | Production | High (isolated container) |
modal | Cloud scaling | High |
Local Models
vLLM (OpenAI-compatible)
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct --port 8000Set LOCAL_VLLM_BASE_URL=http://localhost:8000/v1. Use local-vllm/meta-llama/Llama-3.1-8B-Instruct.
Ollama
ollama serve && ollama pull llama3.1Set LOCAL_OLLAMA_BASE_URL=http://localhost:11434/v1. Use local/llama3.1.
API Key Resolution
Router resolves keys via two-tier fallback:
| Priority | Source | Path |
|---|---|---|
| 1 | Shield | {tenant}/api_keys/{provider} |
| 2 | Router env | OPENAI_API_KEY, etc. |