NLP Enrichment

ContextBrain enriches documents during ingestion using local NLP pipelines — no LLM calls required unless explicitly configured.

Pipelines

Pipeline	Library	Purpose	Model
NER	spaCy	Entity extraction (ORG, GPE, PERSON, etc.)	`en_core_web_sm`
Topics	KeyBERT	Keyword/topic extraction	`all-MiniLM-L6-v2`
Classification	sentence-transformers	Zero-shot categorization	`all-MiniLM-L6-v2`

All pipelines are local (no external API calls) and optional (degrade gracefully if dependencies are missing).

Zero-Shot Classification

Documents are automatically classified against 12 predefined domain labels:

commerce, infrastructure, security, documentation, data engineering, machine learning, web development, devops, api design, database, observability, testing

Custom labels can be provided per use case:

from contextunity.brain.service.nlp import NLPEnricher

enricher = NLPEnricher(category_labels=["finance", "healthcare", "legal"])
result = enricher.enrich("Patient records must comply with HIPAA regulations.")
print(result.top_category)  # "healthcare"

IntelligenceHub

IntelligenceHub is the enrichment coordinator that orchestrates all NLP pipelines during ingestion. It falls back to regex-based extractors if NLP dependencies are not installed:

from contextunity.brain.modules.intelligence.hub import IntelligenceHub

hub = IntelligenceHub()
result = await hub.enrich_content(text)
# result keys: entities, keyphrases, keywords, topics, language

The hub is invoked automatically during document ingestion via IngestionService._enrich_metadata().

Installation

NLP libraries are optional extras:

# Install NLP support
uv pip install contextunity-brain[nlp]

# Download spaCy model
uv run python -m spacy download en_core_web_sm

Without these dependencies, Brain still functions — enrichment simply returns empty results and logs a warning.

Embedding Cache

Brain caches all computed embeddings to avoid redundant API calls:

Query text → sha256 hash → cache key
      ↓
┌─────────────┐     ┌──────────────┐
│ Redis Cache │ ──▶ │ Return cached │
│ (primary)   │ HIT │ embedding    │
└─────────────┘     └──────────────┘
      │ MISS
      ▼
┌──────────────┐     ┌──────────────┐
│ In-Memory    │ ──▶ │ Return cached │
│ LRU (2048)   │ HIT │ embedding    │
└──────────────┘     └──────────────┘
      │ MISS
      ▼
┌──────────────┐     ┌──────────────┐
│ OpenAI API   │ ──▶ │ Store in both│
│ (or local)   │     │ caches + ret │
└──────────────┘     └──────────────┘

Cache Details

Property	Value
Redis key format	`emb:{sha256_of_model:text}` (~20 KB per entry)
TTL	7 days (configurable via `EmbeddingCache.TTL_SECONDS`)
Fallback	In-memory dict (max 2048 entries) if Redis is down
Stats logging	Every 50 requests with hit rate percentage
Config	`REDIS_URL` environment variable

from contextunity.brain.service.embedders import get_embedding_cache

cache = get_embedding_cache()
print(cache.stats)  # {'hits': 45, 'misses': 5, 'hit_rate': '90%'}