Skip to content

Model Registry & Providers

ContextRouter never instantiates LLM provider SDKs directly. All AI generation is routed through the unified Model Registry.

Model Keys

Models are selected by registry key: "<provider>/<name>"

ProviderKey PatternModalities
Vertex AIvertex/gemini-2.5-flashText + Image + Audio + Video
OpenAIopenai/gpt-5-miniText + Image + Audio (ASR)
Anthropicanthropic/claude-sonnet-4Text + Image
Groqgroq/llama-3.3-70b-versatileText + Image + Audio (ASR)
Perplexityperplexity/sonarText
OpenRouteropenrouter/openai/gpt-5.1Text + Image
RLMrlm/gpt-5-miniText
Local (Ollama)local/llama3.1Text + Image
Local (vLLM)local-vllm/meta-llama/Llama-3.1-8B-InstructText + Image
RunPodrunpod/custom-modelText + Image
HuggingFacehf/distilgpt2Task-dependent
HF Hubhf-hub/model-nameTask-dependent
OpenAI Batchopenai-batch/gpt-5-miniText

Multimodal Interface

All providers share a unified multimodal contract:

from contextunity.router.modules.models.types import ModelRequest, TextPart, ImagePart
# Text-only request
request = ModelRequest(
parts=[TextPart(text="Hello, world!")],
system="You are a helpful assistant",
temperature=0.7,
)
# Multimodal request (text + image)
request = ModelRequest(
parts=[
TextPart(text="What's in this image?"),
ImagePart(mime="image/jpeg", data_b64="...", uri="https://example.com/image.jpg"),
]
)

Fallback System

Strategies

StrategyBehaviorStreaming
fallback (sequential)Try candidates in orderSame — never switch mid-stream
parallelRun all concurrently, return first successFalls back to sequential
cost-prioritySame as fallback; order cheapest → most expensiveSequential

Error Handling

# Quota exhaustion → immediate fallback (no retries)
except ModelQuotaExhaustedError:
continue
# Rate limiting → fallback with delay
except ModelRateLimitError:
continue

Project vs Global Fallback

  • Project: Specifies fallback_keys in contextunity.project.yaml manifest — per-node control
  • Global: CU_ROUTER_ALLOW_GLOBAL_FALLBACK=true + CU_ROUTER_FALLBACK_LLMS — safety net

If a node exhausts its fallback_keys and no global fallback is configured, the request fails gracefully.

Reasoning Models (gpt-5, o1, o3)

Reasoning models require special handling:

# Use max_completion_tokens, not max_tokens
# Include extra budget for chain-of-thought reasoning
if is_reasoning_model:
bind_kwargs["max_completion_tokens"] = 8000 # 4k reasoning + 4k response
# Temperature must be 1 for reasoning model API

RLM (Recursive Language Models)

RLM wraps any base LLM with recursive REPL capabilities. For processing massive contexts (50k+ items) where standard LLMs experience context degradation.

Reference: arXiv:2512.24601 | GitHub

Key Benefits:

  • GPT-5-mini with RLM outperforms GPT-5 on long-context tasks
  • Context stored as Python variable, not in prompt
  • Model can grep, filter, iterate, and recursively analyze
  • 60-70% cost reduction for bulk processing
model = model_registry.create_llm(
"rlm/gpt-5-mini",
config=config,
environment="docker", # Isolated execution (recommended)
)
EnvironmentUse CaseSafety
localDevelopmentLow (same process)
dockerProductionHigh (isolated container)
modalCloud scalingHigh

Local Models

vLLM (OpenAI-compatible)

Terminal window
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct --port 8000

Set LOCAL_VLLM_BASE_URL=http://localhost:8000/v1. Use local-vllm/meta-llama/Llama-3.1-8B-Instruct.

Ollama

Terminal window
ollama serve && ollama pull llama3.1

Set LOCAL_OLLAMA_BASE_URL=http://localhost:11434/v1. Use local/llama3.1.

API Key Resolution

Router resolves keys via two-tier fallback:

PrioritySourcePath
1Shield{tenant}/api_keys/{provider}
2Router envOPENAI_API_KEY, etc.