dnomia-knowledge
Local knowledge engine for codebases with hybrid search, knowledge graph, and interaction tracking, enabling Claude Code to search and interact with project knowledge locally.
README
dnomia-knowledge
Local knowledge engine for codebases. Indexes your markdown and source code into a single SQLite database with hybrid search (FTS5 keyword + vector semantic), knowledge graph, and developer interaction tracking.
Built for Claude Code via MCP (Model Context Protocol). Works entirely on your machine. Your code never leaves your computer.
Why
Developer tools treat every session as a blank slate. You search for the same function, re-read the same config file, and re-discover the same architectural pattern across dozens of conversations. Grep finds exact strings but misses semantic matches. Embeddings find related concepts but miss exact terms. Neither remembers what you searched for last week.
Existing solutions don't fit solo developers:
- GitHub Copilot / Cursor indexing is cloud-dependent and opaque. Your code leaves your machine.
- RAG pipelines require infrastructure (vector DBs, embedding APIs, chunking services) that cost money and attention.
- IDE search is per-project, per-session, with no memory of what matters to you.
dnomia-knowledge runs entirely on your machine: one SQLite database, one embedding model, hybrid search that combines keyword precision with semantic recall. It tracks which files you actually read and edit, boosts search results by your usage patterns, and syncs automatically on every commit. No cloud, no API keys, no infrastructure to manage.
What it does
Hybrid search across your projects. Not just grep, not just embeddings. FTS5 (BM25) and sqlite-vec (cosine KNN) run in parallel, merged with Reciprocal Rank Fusion. Finds code and documentation that keyword search misses and vector search misranks.
Knowledge graph over your codebase. Chunks are connected by markdown links, shared tags, categories, import statements, and semantic similarity. Community detection (Louvain) and PageRank surface the structure of your project.
Interaction tracking learns what matters to you. Every file you read and edit is logged. Search results are boosted by your actual usage patterns. Trace analytics show which files are hot, which knowledge gaps exist, and which areas are decaying.
Cross-project search lets you query across all your indexed repositories at once. Related projects can be linked via config for unified search results.
Continuous indexing keeps everything fresh. Git post-commit hooks and a periodic job (launchd on macOS) re-index changed files automatically. No daemon, no persistent memory usage.
Knowledge lifecycle (schema v4, opt-in) adds confidence scoring, supersession, and contradiction detection on top of the index. Chunks decay when ignored and strengthen when you read or edit them. Duplicate slugs and code-symbol collisions surface automatically. Ranking can be tuned by confidence, superseded chunks drop out of default search, and the lifecycle MCP tool exposes state to Claude Code. Inspired by the LLM Wiki v2 proposal extending Karpathy's LLM Wiki concept. Enable with [lifecycle] enabled = true in .knowledge.toml and DNOMIA_KNOWLEDGE_LIFECYCLE=1 for the hook.
Lifecycle at a glance
# See contradictions (e.g. duplicate slugs across posts)
dnomia-knowledge contradictions --project my-site
# Inspect a chunk's confidence and event history
dnomia-knowledge confidence 1234
# Mark an outdated doc as superseded by its rewrite
dnomia-knowledge supersede 1234 5678 --yes
# Apply daily decay sweep (or install launchd via --plist)
dnomia-knowledge forget --project my-site
From Claude Code via MCP: lifecycle(chunk_id, action='info') shows
state, action='reinforce' bumps confidence, action='supersede' and
action='restore' edit the supersession pointer.
Quick start
# Clone and install
git clone https://github.com/ceaksan/dnomia-knowledge.git
cd dnomia-knowledge
python3.11 -m venv .venv
source .venv/bin/activate
pip install -e .
# Index your first project
dnomia-knowledge index /path/to/your/project
# Search
dnomia-knowledge search "authentication middleware"
# See what files you access most
dnomia-knowledge trace hot
The embedding model (intfloat/multilingual-e5-base, ~500MB) downloads automatically on first run.
Connect to Claude Code
Add to ~/.claude/settings.json under mcpServers:
{
"dnomia-knowledge": {
"command": "/path/to/dnomia-knowledge/.venv/bin/python",
"args": ["-m", "dnomia_knowledge.server"],
"env": {
"DNOMIA_KNOWLEDGE_PROJECT": "my-project"
}
}
}
Claude Code now has access to 6 MCP tools: search, index_project, project_info, graph_query, read_file, and fetch_and_index.
Claude Code hooks (optional)
Track file interactions automatically by adding hooks to ~/.claude/settings.json:
{
"hooks": {
"PreToolUse": [
{
"matcher": "Read|Grep",
"hooks": [
{
"type": "command",
"command": "/path/to/.venv/bin/python -m dnomia_knowledge.hooks.pre_tool_use"
}
]
}
],
"PostToolUse": [
{
"matcher": "Read|Edit",
"hooks": [
{
"type": "command",
"command": "/path/to/.venv/bin/python -m dnomia_knowledge.hooks.post_tool_use"
}
]
}
]
}
}
The PreToolUse hook redirects large file reads (>300 lines) to the knowledge base search. The PostToolUse hook logs every Read/Edit for interaction tracking and search ranking.
Project configuration
Create .knowledge.toml in your project root to control what gets indexed:
[project]
name = "my-project"
type = "saas" # content | saas | static
[content]
paths = ["docs/"]
extensions = [".md", ".mdx"]
[code]
preset = "python" # web | python | django | mixed
paths = ["src/"]
max_chunk_lines = 50
[graph]
enabled = true
edge_types = ["link", "tag", "semantic", "import"]
semantic_threshold = 0.75
[indexing]
ignore_patterns = ["node_modules", "dist", "__pycache__", ".venv"]
max_file_size_kb = 500
[injection]
enabled = false # experimental; off by default
max_hint_tokens = 500
cache_ttl_seconds = 300
hot_limit = 3
recent_limit = 3
Without .knowledge.toml, defaults to indexing .md and .mdx files only.
Experimental: passive context injection
With [injection] enabled = true, the server augments the search tool description on every tools/list call with a compact hint listing the top hot files and recent edits for DNOMIA_KNOWLEDGE_PROJECT. The model sees this passively, without calling search first. Hint is capped at 500 tokens, cached 5 minutes, falls back silently on error. See ADR-001.
CLI reference
Indexing
dnomia-knowledge index <path> # Index a project directory
dnomia-knowledge index <path> --full # Force full reindex (skip incremental)
dnomia-knowledge index-all # Index all registered projects
dnomia-knowledge index-all --changed # Only projects with changes since last index
Search
dnomia-knowledge search <query> # Search all projects
dnomia-knowledge search <query> -p my-project # Filter by project
dnomia-knowledge search <query> -d code # Filter by domain (code|content|all)
dnomia-knowledge search <query> --lang python # Filter by language
Trace analytics
dnomia-knowledge trace hot # Most accessed files (reads + edits + searches)
dnomia-knowledge trace gaps # Searches that returned zero results
dnomia-knowledge trace decay # Files with declining activity over time
dnomia-knowledge trace queries # Most frequent search patterns
All trace commands accept --project/-p, --days/-d (default 30), and --limit/-l (default 20).
Git history analysis
dnomia-knowledge git-sync <path> # Sync git log into the database
dnomia-knowledge analyze churn # Most modified files by insertions + deletions
dnomia-knowledge analyze hotspots # Directory-level churn aggregation
dnomia-knowledge analyze crossover # Fuse git churn with trace read data
Crossover analysis assigns signals to files based on change frequency vs read frequency:
| Signal | Meaning |
|---|---|
| BLIND | High churn, zero reads. Changing but never consulted. |
| TURBULENT | High churn, low reads. Unstable and under-monitored. |
| HOT | High churn, high reads. Core active area. |
| STABLE | Low churn, high reads. Settled reference code. |
| ZOMBIE | Zero churn, some reads. Read but never touched. |
| COLD | Low churn, low reads. Inactive. |
Knowledge graph
dnomia-knowledge graph rebuild # Rebuild all edges for a project
dnomia-knowledge graph communities # Run Louvain community detection + PageRank
Continuous indexing
dnomia-knowledge install-hooks # Git post-commit hooks on all projects
dnomia-knowledge install-hooks --uninstall
dnomia-knowledge install-launchd # macOS launchd job (every 5 min)
dnomia-knowledge install-launchd --uninstall
Other
dnomia-knowledge project-info # List all projects with stats
dnomia-knowledge read-file <path> # Smart file reading with chunk awareness
dnomia-knowledge export # CSV export of all chunks
How it works
Search pipeline
Query
-> embed with "query: " prefix (768d vector, cached)
-> FTS5 BM25 search (keyword matching)
-> sqlite-vec KNN search (semantic similarity)
-> RRF merge (k=60): score = sum(1/(k + rank + 1))
-> Fallback: prefix matching if both return empty
-> Interaction boost: re-rank by read/edit frequency (30-day window)
-> Return top N with snippets
Indexing pipeline
Project directory
-> Scan: filter by extension, size, .gitignore, config patterns
-> For each changed file (MD5 hash comparison):
-> .md/.mdx -> heading-based chunker (##/### splits, frontmatter)
-> code -> Tree-sitter AST chunker (functions, classes, methods)
-> Embed passages (batch=8, "passage: " prefix)
-> Atomic transaction: delete old + insert chunks + insert vectors
-> Build graph edges (link, tag, category, semantic, import)
-> Update project metadata + git commit hash
Continuous indexing
git commit -> post-commit hook -> file lock -> background reindex
launchd (every 5 min) -> index-all --changed -> git HEAD comparison -> reindex
Only one index process runs at a time. File locks prevent concurrent embedding model loads (protects 8GB RAM machines).
Architecture
Single SQLite database with three search layers:
| Layer | Technology | Purpose |
|---|---|---|
| Keyword | FTS5 (BM25) | Porter stemmer, unicode61 tokenizer |
| Semantic | sqlite-vec | Cosine KNN on 768d normalized vectors |
| Graph | NetworkX | Louvain communities, PageRank, BFS traversal |
Embedding model: intfloat/multilingual-e5-base (768d). Lazy loaded on first search, auto-unloads after 10 minutes idle. Runs on 8GB RAM.
Code parsing: Tree-sitter with language pack. Extracts functions, classes, methods, structs, interfaces, enums with proper boundaries. Falls back to sliding-window chunking for unsupported languages.
Module structure
src/dnomia_knowledge/
server.py MCP server (6 tools, thread-safe singletons)
store.py SQLite persistence, schema v3, migrations, triggers
search.py Hybrid FTS5 + vector, RRF merge, interaction boost
indexer.py Scan -> chunk -> embed -> store pipeline
graph.py Edge builder, Louvain community detection, PageRank
embedder.py Lazy sentence-transformer, LRU cache, auto-unload
cli.py Rich CLI with 10+ commands
registry.py .knowledge.toml config loader (Pydantic v2)
models.py Chunk, SearchResult, IndexResult, InteractionType
chunker/
md_chunker.py Heading-based markdown splitter
ast_chunker.py Tree-sitter AST chunker with fallback
languages.py Per-language AST node type mappings
hooks/
pre_tool_use.py Redirects large file reads to search
post_tool_use.py Logs read/edit interactions
Data model
| Table | Purpose |
|---|---|
projects |
Registered projects with path, type, graph config, last indexed commit |
chunks |
Indexed content and code pieces with metadata |
chunks_vec |
sqlite-vec virtual table for vector embeddings (768d) |
chunks_fts |
FTS5 virtual table mirroring chunk content |
file_index |
Per-file MD5 hash tracking for incremental indexing |
edges |
Knowledge graph edges (link, tag, category, semantic, import) |
chunk_interactions |
Read/edit/search_hit tracking for boost and analytics |
search_log |
Query history for gap analysis and pattern tracking |
git_commits |
Parsed git log entries |
git_file_changes |
Per-file diff stats from git history |
Triggers auto-sync FTS5 on chunk insert/update/delete. Vector cleanup triggers on chunk delete.
Environment variables
| Variable | Default | Description |
|---|---|---|
DNOMIA_KNOWLEDGE_DB |
~/.local/share/dnomia-knowledge/knowledge.db |
Database path |
DNOMIA_KNOWLEDGE_PROJECT |
(none) | Default project for MCP search |
Requirements
- Python 3.11+
- macOS or Linux (launchd is macOS only, git hooks work everywhere)
- ~500MB disk for embedding model (downloaded once)
- 8GB RAM minimum (embedding model loads lazily)
Development
pip install -e ".[dev]"
python -m pytest tests/ -v # 276 tests
ruff check src/ tests/ # Linting
License
MIT
推荐服务器
Baidu Map
百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Playwright MCP Server
一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。
Magic Component Platform (MCP)
一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。
Audiense Insights MCP Server
通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。
VeyraX
一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。
Kagi MCP Server
一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。
graphlit-mcp-server
模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。
e2b-mcp-server
使用 MCP 通过 e2b 运行代码。
Neon MCP Server
用于与 Neon 管理 API 和数据库交互的 MCP 服务器
Exa MCP Server
模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。