mcp-codesearch
MCP server for semantic code search with AST-aware chunking, hybrid vectors, and query syntax.
README
mcp-codesearch
MCP server for semantic code search with AST-aware chunking, hybrid vectors, and query syntax.
Prerequisites
- Python 3.12+
- Linux or macOS (uses POSIX file locking via vector-core; not compatible with Windows)
- Qdrant vector database (default:
localhost:6333) - An OpenAI-compatible embedding API (e.g., llama.cpp, Ollama, or any
/v1/embeddingsendpoint; default:localhost:8080)
Installation
Requires vector-core.
pip install git+https://github.com/michaelkrauty/vector-core.git@v1.2.11
pip install git+https://github.com/michaelkrauty/mcp-codesearch.git
Or clone both repos and install locally:
git clone https://github.com/michaelkrauty/vector-core.git
git clone https://github.com/michaelkrauty/mcp-codesearch.git
pip install -e vector-core/
pip install -e mcp-codesearch/
Quick Start
# Register with Claude Code:
claude mcp add codesearch -- mcp-codesearch
# Or add to your MCP client config (e.g., claude_desktop_config.json):
# {
# "mcpServers": {
# "codesearch": {
# "command": "mcp-codesearch",
# "env": {
# "VECTOR_QDRANT_URL": "http://localhost:6333",
# "VECTOR_EMBEDDING_URL": "http://localhost:8080",
# "VECTOR_EMBEDDING_MODEL": "your-model-name",
# "VECTOR_EMBEDDING_DIM": "768"
# }
# }
# }
# }
Features
- Hybrid Search: Dense embeddings + sparse TF-IDF with RRF fusion
- AST-Aware Chunking: Tree-sitter extracts functions, classes, methods with context
- 18 Languages with AST Support: Python, JS/TS, Go, Rust, Java, C/C++, Ruby, PHP, Swift, Kotlin, Scala, C#, SQL, JSON, YAML, TOML (line-based fallback for Bash, HTML, CSS, and other file types)
- Query Syntax:
function:name,class:name,file:pattern,path:prefix,-path:exclude - Incremental Indexing: Change detection via mtime+size before hashing
- Query Preprocessing: Synonym expansion (
fn→function,db→database) - Flexible Ignores: Nested
.gitignore,.git/info/exclude, and.codesearchignore(gitignore syntax) honored at every directory level
Tools (11 total)
Search (5)
| Tool | Description |
|---|---|
code_search |
Main search with auto-indexing |
search_multiple |
Search across multiple codebases |
search_changed |
Search in recently changed files (git-aware) |
find_similar |
Find code similar to a snippet |
find_references |
Find all usages of a symbol |
Index Management (3)
| Tool | Description |
|---|---|
index_status |
Check indexing status, file count, pending changes |
force_reindex |
Force complete re-indexing |
preview_index |
Preview what would be indexed |
Collection Management (3)
| Tool | Description |
|---|---|
list_collections |
List all indexed codebases |
delete_collection |
Remove index for a codebase |
cleanup_orphans |
Remove orphaned collections |
Query Syntax
# Natural language (semantic search)
code_search("websocket reconnection logic")
# Function search
code_search("function:handleRequest")
code_search("fn:handleRequest") # alias
# Class search
code_search("class:WebSocketClient")
code_search("cls:WebSocketClient") # alias
# Path filtering
code_search("auth path:src/services")
code_search("test -path:vendor -path:node_modules")
# Filename filtering (glob, case-insensitive, matches filename only)
# Pushed into the retrieval layer when possible, so a match in the named
# file is found even if it would rank below the candidate pool
code_search("connection pooling file:db.py")
code_search("schema migration file:*.sql")
# Struct search (Rust, C, Go)
code_search("struct:Message")
# Combined
code_search("function:process_data path:src -path:test")
# Exact phrase
code_search('"exact function name"')
Synonym Expansion
Common abbreviations automatically expanded:
fn,func→functioncls→classdb→databasews→websocketauth→authentication,authorizationreq,res→request,response
Additional Query Syntax
# Alternative function search aliases
code_search("def:processData")
code_search("method:handleRequest")
# Type/struct alias
code_search("type:UserConfig")
# Scope filters (restrict to chunk types)
code_search("error scope:function") # Only function chunks
code_search("model scope:class") # Only class chunks
code_search("validate scope:test") # Only test functions
code_search("handler scope:impl") # Non-test code only
# scope:method is an alias for scope:function; scope:struct, scope:enum,
# scope:interface, scope:type and scope:module are aliases for scope:class
Search Modes
| Mode | Description |
|---|---|
file |
File-level results (overview) |
chunk |
Function/class-level results (detailed) |
both |
Combined ranking (default) |
AST Chunking
Tree-sitter extracts semantic units:
- Functions (with docstrings)
- Classes (with methods if small, or overview + separate methods if large)
- Methods (with parent class context)
- Modules (imports, top-level statements)
Fallback to line-based chunking for non-code files (JSON, YAML, TOML, Markdown).
Path Boosting
Search results boosted/demoted by path:
| Pattern | Adjustment |
|---|---|
src/ |
+10% |
lib/, core/ |
+8% |
test/, tests/ |
-10% |
vendor/ |
-25% |
generated/ |
-30% |
Git Integration
search_changed searches only files changed since a git revision or time. The
changed-file set is applied as a retrieval-layer filter, so results are ranked
within the changed files rather than intersected against a bounded whole-codebase
candidate pool (change sets over 500 files fall back to post-filtering).
search_changed("auth logic", since="HEAD~5")
search_changed("database", since="main")
search_changed("fix", since="abc123")
search_changed("config", since="3.days.ago")
Configuration
| Variable | Default | Description |
|---|---|---|
VECTOR_QDRANT_URL |
http://localhost:6333 |
Qdrant server |
VECTOR_EMBEDDING_URL |
http://localhost:8080 |
OpenAI-compatible embeddings API |
VECTOR_EMBEDDING_MODEL |
(required) | Embedding model name (e.g., nomic-embed-text, text-embedding-3-small) |
VECTOR_EMBEDDING_DIM |
(required) | Vector dimension (must match your model, e.g., 768, 1536) |
Changing the embedding model. A codebase's index is tied to the embedding model it was built with. If you switch
VECTOR_EMBEDDING_MODEL, the next search or index of that codebase fails fast with a clear error pointing atforce_reindex, instead of a cryptic Qdrant dimension error (different-dimension swap) or silently meaningless results from incompatible embedding spaces (same-dimension swap — the model name is recorded in each collection's metadata and checked on reuse). Runforce_reindexon the affected codebase to rebuild it with the new model — each codebase is reindexed independently.
Codesearch-specific settings (configured via environment variables with the CODESEARCH_ prefix):
| Variable | Default | Description |
|---|---|---|
CODESEARCH_CLASS_SPLIT_THRESHOLD |
50 |
Lines threshold for splitting large classes |
CODESEARCH_CHUNK_MIN_LINES |
10 |
Merge chunks smaller than this |
CODESEARCH_CHUNK_MAX_LINES |
500 |
Max lines per fallback chunk |
CODESEARCH_CHUNK_OVERLAP_LINES |
25 |
Overlap between fallback chunks |
CODESEARCH_SEARCH_CACHE_MAX_SIZE |
100 |
Max cached search results |
CODESEARCH_SEARCH_CACHE_TTL_SECONDS |
300 |
Search cache TTL (seconds) |
CODESEARCH_SEARCH_CACHE_EVICTION_RATIO |
0.2 |
Fraction of cache to evict when full |
CODESEARCH_UPSERT_BATCH_TIMEOUT |
300 |
Batch operation timeout (seconds) |
CODESEARCH_UPSERT_CONCURRENCY |
1 |
Max concurrent upsert batches |
CODESEARCH_DELETION_CONCURRENCY |
50 |
Concurrent Qdrant operations during incremental indexing |
Change Detection
Fast incremental updates:
- Check mtime + size (skip unchanged files)
- Hash only modified files
- Re-index only changed chunks
Avoids full re-embedding on every search.
Ignoring files
File discovery honors gitignore-syntax exclude rules at every directory level:
.gitignore— nested.gitignorefiles are respected, matching git semantics (deeper files override shallower,!negations re-include)..git/info/exclude— repo-local excludes that are not committed to git..codesearchignore— exclude paths from indexing without changing git's behavior. Same syntax as.gitignore; useful for vendored code, generated files, or large data you want tracked by git but kept out of the index.
Ignored directories are pruned during traversal, so excluded subtrees cost nothing. The global core.excludesFile is intentionally not consulted, so indexing stays reproducible regardless of per-machine git configuration.
Storage
| Data | Location |
|---|---|
| Index | Qdrant collection codesearch_{path_hash} |
| Metadata | Stored in Qdrant point payloads |
Each indexed codebase gets a unique collection based on path hash.
Supported Languages
Full tree-sitter AST support (18 languages): Python, JavaScript, TypeScript, Go, Rust, Java, C, C++, Ruby, PHP, Swift, Kotlin, Scala, C#, SQL, JSON, YAML, TOML
Line-based fallback: Bash, HTML, CSS, and all other file types (Markdown, Vue, Svelte, config files, etc.) are indexed with line-based chunking.
Jupyter notebooks (.ipynb): Notebooks are indexed by their code. Code cells are extracted (markdown, raw, and output cells are skipped) and chunked as Python with full AST support, so a notebook's functions and classes are searchable just like any other source file. Code-less or unparseable notebooks are skipped.
Dependencies
Requires vector-core components:
- EmbeddingClient, GlobalVocabulary (embeddings)
- QdrantStorage, HybridSearcher (storage)
External libraries:
- tree-sitter-language-pack (AST parsing)
- pathspec (.gitignore / .codesearchignore support)
推荐服务器
Baidu Map
百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Playwright MCP Server
一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。
Magic Component Platform (MCP)
一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。
Audiense Insights MCP Server
通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。
VeyraX
一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。
graphlit-mcp-server
模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。
Kagi MCP Server
一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。
e2b-mcp-server
使用 MCP 通过 e2b 运行代码。
Neon MCP Server
用于与 Neon 管理 API 和数据库交互的 MCP 服务器
Exa MCP Server
模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。