Skip to content

NLP Enrichment

ContextBrain enriches documents during ingestion using local NLP pipelines — no LLM calls required unless explicitly configured.

Pipelines

PipelineLibraryPurposeModel
NERspaCyEntity extraction (ORG, GPE, PERSON, etc.)en_core_web_sm
TopicsKeyBERTKeyword/topic extractionall-MiniLM-L6-v2
Classificationsentence-transformersZero-shot categorizationall-MiniLM-L6-v2

All pipelines are local (no external API calls) and optional (degrade gracefully if dependencies are missing).

Zero-Shot Classification

Documents are automatically classified against 12 predefined domain labels:

commerce, infrastructure, security, documentation, data engineering, machine learning, web development, devops, api design, database, observability, testing

Custom labels can be provided per use case:

from contextunity.brain.service.nlp import NLPEnricher
enricher = NLPEnricher(category_labels=["finance", "healthcare", "legal"])
result = enricher.enrich("Patient records must comply with HIPAA regulations.")
print(result.top_category) # "healthcare"

IntelligenceHub

IntelligenceHub is the enrichment coordinator that orchestrates all NLP pipelines during ingestion. It falls back to regex-based extractors if NLP dependencies are not installed:

from contextunity.brain.modules.intelligence.hub import IntelligenceHub
hub = IntelligenceHub()
result = await hub.enrich_content(text)
# result keys: entities, keyphrases, keywords, topics, language

The hub is invoked automatically during document ingestion via IngestionService._enrich_metadata().

Installation

NLP libraries are optional extras:

Terminal window
# Install NLP support
uv pip install contextunity-brain[nlp]
# Download spaCy model
uv run python -m spacy download en_core_web_sm

Without these dependencies, Brain still functions — enrichment simply returns empty results and logs a warning.

Embedding Cache

Brain caches all computed embeddings to avoid redundant API calls:

Query text → sha256 hash → cache key
┌─────────────┐ ┌──────────────┐
│ Redis Cache │ ──▶ │ Return cached │
│ (primary) │ HIT │ embedding │
└─────────────┘ └──────────────┘
│ MISS
┌──────────────┐ ┌──────────────┐
│ In-Memory │ ──▶ │ Return cached │
│ LRU (2048) │ HIT │ embedding │
└──────────────┘ └──────────────┘
│ MISS
┌──────────────┐ ┌──────────────┐
│ OpenAI API │ ──▶ │ Store in both│
│ (or local) │ │ caches + ret │
└──────────────┘ └──────────────┘

Cache Details

PropertyValue
Redis key formatemb:{sha256_of_model:text} (~20 KB per entry)
TTL7 days (configurable via EmbeddingCache.TTL_SECONDS)
FallbackIn-memory dict (max 2048 entries) if Redis is down
Stats loggingEvery 50 requests with hit rate percentage
ConfigREDIS_URL environment variable
from contextunity.brain.service.embedders import get_embedding_cache
cache = get_embedding_cache()
print(cache.stats) # {'hits': 45, 'misses': 5, 'hit_rate': '90%'}