Skip to content

Model Registry & Providers

ContextRouter never instantiates LLM provider SDKs directly. All AI generation is routed through the unified Model Registry.

Model Keys

Models are selected by registry key: "<provider>/<name>"

ProviderKey PatternModalities
Vertex AIvertex/gemini-2.5-flashText + Image + Audio + Video
OpenAIopenai/gpt-5-miniText + Image + Audio (ASR)
Anthropicanthropic/claude-sonnet-4Text + Image
Groqgroq/llama-3.3-70b-versatileText + Image + Audio (ASR)
Perplexityperplexity/sonarText
OpenRouteropenrouter/openai/gpt-5.1Text + Image
RLMrlm/gpt-5-miniText
Local (Ollama)local/llama3.1Text + Image
Local (vLLM)local-vllm/meta-llama/Llama-3.1-8B-InstructText + Image
RunPodrunpod/custom-modelText + Image
HuggingFacehf/distilgpt2Task-dependent
HF Hubhf-hub/model-nameTask-dependent
OpenAI Batchopenai-batch/gpt-5-miniText

Multimodal Interface

All providers share a unified multimodal contract:

from contextunity.router.modules.models.types import ModelRequest, TextPart, ImagePart
# Text-only request
request = ModelRequest(
parts=[TextPart(text="Hello, world!")],
system="You are a helpful assistant",
temperature=0.7,
)
# Multimodal request (text + image)
request = ModelRequest(
parts=[
TextPart(text="What's in this image?"),
ImagePart(mime="image/jpeg", data_b64="...", uri="https://example.com/image.jpg"),
]
)

Fallback System

Strategies

StrategyBehaviorStreaming
fallback (sequential)Try candidates in orderSame — never switch mid-stream
parallelRun all concurrently, return first successFalls back to sequential
cost-prioritySame as fallback; order cheapest → most expensiveSequential

Error Handling

# Quota exhaustion → immediate fallback (no retries)
except ModelQuotaExhaustedError:
continue
# Rate limiting → fallback with delay
except ModelRateLimitError:
continue

Project vs Global Fallback

  • Project: Specifies fallback_keys in contextunity.project.yaml manifest — per-node control
  • Global: CU_ROUTER_ALLOW_GLOBAL_FALLBACK=true + CU_ROUTER_FALLBACK_LLMS — safety net

If a node exhausts its fallback_keys and no global fallback is configured, the request fails gracefully.

Model Type Hierarchy

BaseModel (Identity: provider, model_name, model_key, _provider_info)
├── BaseLLM (Text generation: generate(), stream(), retry loop)
│ ├── OpenAILLM, AnthropicLLM, VertexLLM, GroqLLM, ...
│ └── FallbackModel (orchestrator: budget_usd, candidate sequencing)
└── BaseEmbeddings (Vector: embed_text(), embed_batch())

BaseLLM centralises:

  • Auto cost estimation — calls estimate_cost() when provider returns total_cost=None
  • ProviderInfo auto-attach — no per-provider boilerplate
  • Retry loop — backoff, trigger classification, timeout budget

Retry & Cost Governance

Three-tier retry hierarchy

TierSourceExample
Globalpolicy.models.retryDefault for all model types
Per-typepolicy.models.llm.retryOverride for LLMs specifically
Per-nodenode.config.retryHighest priority, e.g. JSON-output nodes
policy:
models:
retry:
max_attempts: 2
backoff: exponential
timeout_sec: 30
llm:
retry:
retry_on: [rate_limit, timeout, network, response_format]
max_attempts: 3
nodes:
- name: planner
config:
output_format: "json"
retry:
max_attempts: 3
retry_on: [rate_limit, timeout, network, response_format]

Error taxonomy

ErrorRetryableFallbackableTrigger key
ConnectionError / 5xxnetwork
ModelTimeoutErrortimeout
ModelRateLimitErrorrate_limit
ModelResponseFormatError✅ (opt-in)response_format
ModelQuotaExhaustedError
ModelBudgetExceededError❌ (hard stop)

Per-node cost budget

Set budget_usd in ModelsPolicy to cap total cost across all model candidates (primary + fallbacks + retries). Exceeding the budget triggers ModelBudgetExceededError — a hard stop with no fallback.

policy:
models:
budget_usd: 0.50
llm:
default: openai/gpt-5-mini
fallback: [vertex/gemini-2.5-flash]

Retry vs Fallback flow

FallbackModel.generate(request, budget_usd=0.50)
└── for each candidate:
├── model.generate(request, retry_policy=...)
│ └── BaseLLM retry loop (backoff, trigger check)
├── success → track cumulative cost → check budget_usd
├── ModelBudgetExceededError → HARD STOP
├── catch ModelError → next candidate
└── all exhausted → ModelExhaustedError

Reasoning Models (gpt-5, o1, o3)

Reasoning models require special handling:

# Use max_completion_tokens, not max_tokens
# Include extra budget for chain-of-thought reasoning
if is_reasoning_model:
bind_kwargs["max_completion_tokens"] = 8000 # 4k reasoning + 4k response
# Temperature must be 1 for reasoning model API

RLM (Recursive Language Models)

RLM wraps any base LLM with recursive REPL capabilities. For processing massive contexts (50k+ items) where standard LLMs experience context degradation.

Reference: arXiv:2512.24601 | GitHub

Key Benefits:

  • GPT-5-mini with RLM outperforms GPT-5 on long-context tasks
  • Context stored as Python variable, not in prompt
  • Model can grep, filter, iterate, and recursively analyze
  • 60-70% cost reduction for bulk processing
model = model_registry.create_llm(
"rlm/gpt-5-mini",
config=config,
environment="docker", # Isolated execution (recommended)
)
EnvironmentUse CaseSafety
localDevelopmentLow (same process)
dockerProductionHigh (isolated container)
modalCloud scalingHigh

Local Models

vLLM (OpenAI-compatible)

Terminal window
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct --port 8000

Set LOCAL_VLLM_BASE_URL=http://localhost:8000/v1. Use local-vllm/meta-llama/Llama-3.1-8B-Instruct.

Ollama

Terminal window
ollama serve && ollama pull llama3.1

Set LOCAL_OLLAMA_BASE_URL=http://localhost:11434/v1. Use local/llama3.1.

API Key Resolution

Router resolves keys via two-tier fallback:

PrioritySourcePath
1Shield{tenant}/api_keys/{provider}
2Router envOPENAI_API_KEY, etc.