mcp-codesearch

mcp-codesearch

MCP server for semantic code search with AST-aware chunking, hybrid vectors, and query syntax.

Category
访问服务器

README

mcp-codesearch

MCP server for semantic code search with AST-aware chunking, hybrid vectors, and query syntax.

Prerequisites

  • Python 3.12+
  • Linux or macOS (uses POSIX file locking via vector-core; not compatible with Windows)
  • Qdrant vector database (default: localhost:6333)
  • An OpenAI-compatible embedding API (e.g., llama.cpp, Ollama, or any /v1/embeddings endpoint; default: localhost:8080)

Installation

Requires vector-core.

pip install git+https://github.com/michaelkrauty/vector-core.git@v1.2.11
pip install git+https://github.com/michaelkrauty/mcp-codesearch.git

Or clone both repos and install locally:

git clone https://github.com/michaelkrauty/vector-core.git
git clone https://github.com/michaelkrauty/mcp-codesearch.git
pip install -e vector-core/
pip install -e mcp-codesearch/

Quick Start

# Register with Claude Code:
claude mcp add codesearch -- mcp-codesearch

# Or add to your MCP client config (e.g., claude_desktop_config.json):
# {
#   "mcpServers": {
#     "codesearch": {
#       "command": "mcp-codesearch",
#       "env": {
#         "VECTOR_QDRANT_URL": "http://localhost:6333",
#         "VECTOR_EMBEDDING_URL": "http://localhost:8080",
#         "VECTOR_EMBEDDING_MODEL": "your-model-name",
#         "VECTOR_EMBEDDING_DIM": "768"
#       }
#     }
#   }
# }

Features

  • Hybrid Search: Dense embeddings + sparse TF-IDF with RRF fusion
  • AST-Aware Chunking: Tree-sitter extracts functions, classes, methods with context
  • 18 Languages with AST Support: Python, JS/TS, Go, Rust, Java, C/C++, Ruby, PHP, Swift, Kotlin, Scala, C#, SQL, JSON, YAML, TOML (line-based fallback for Bash, HTML, CSS, and other file types)
  • Query Syntax: function:name, class:name, file:pattern, path:prefix, -path:exclude
  • Incremental Indexing: Change detection via mtime+size before hashing
  • Query Preprocessing: Synonym expansion (fnfunction, dbdatabase)
  • Flexible Ignores: Nested .gitignore, .git/info/exclude, and .codesearchignore (gitignore syntax) honored at every directory level

Tools (11 total)

Search (5)

Tool Description
code_search Main search with auto-indexing
search_multiple Search across multiple codebases
search_changed Search in recently changed files (git-aware)
find_similar Find code similar to a snippet
find_references Find all usages of a symbol

Index Management (3)

Tool Description
index_status Check indexing status, file count, pending changes
force_reindex Force complete re-indexing
preview_index Preview what would be indexed

Collection Management (3)

Tool Description
list_collections List all indexed codebases
delete_collection Remove index for a codebase
cleanup_orphans Remove orphaned collections

Query Syntax

# Natural language (semantic search)
code_search("websocket reconnection logic")

# Function search
code_search("function:handleRequest")
code_search("fn:handleRequest")  # alias

# Class search
code_search("class:WebSocketClient")
code_search("cls:WebSocketClient")  # alias

# Path filtering
code_search("auth path:src/services")
code_search("test -path:vendor -path:node_modules")

# Filename filtering (glob, case-insensitive, matches filename only)
# Pushed into the retrieval layer when possible, so a match in the named
# file is found even if it would rank below the candidate pool
code_search("connection pooling file:db.py")
code_search("schema migration file:*.sql")

# Struct search (Rust, C, Go)
code_search("struct:Message")

# Combined
code_search("function:process_data path:src -path:test")

# Exact phrase
code_search('"exact function name"')

Synonym Expansion

Common abbreviations automatically expanded:

  • fn, funcfunction
  • clsclass
  • dbdatabase
  • wswebsocket
  • authauthentication, authorization
  • req, resrequest, response

Additional Query Syntax

# Alternative function search aliases
code_search("def:processData")
code_search("method:handleRequest")

# Type/struct alias
code_search("type:UserConfig")

# Scope filters (restrict to chunk types)
code_search("error scope:function")    # Only function chunks
code_search("model scope:class")       # Only class chunks
code_search("validate scope:test")     # Only test functions
code_search("handler scope:impl")      # Non-test code only
# scope:method is an alias for scope:function; scope:struct, scope:enum,
# scope:interface, scope:type and scope:module are aliases for scope:class

Search Modes

Mode Description
file File-level results (overview)
chunk Function/class-level results (detailed)
both Combined ranking (default)

AST Chunking

Tree-sitter extracts semantic units:

  • Functions (with docstrings)
  • Classes (with methods if small, or overview + separate methods if large)
  • Methods (with parent class context)
  • Modules (imports, top-level statements)

Fallback to line-based chunking for non-code files (JSON, YAML, TOML, Markdown).

Path Boosting

Search results boosted/demoted by path:

Pattern Adjustment
src/ +10%
lib/, core/ +8%
test/, tests/ -10%
vendor/ -25%
generated/ -30%

Git Integration

search_changed searches only files changed since a git revision or time. The changed-file set is applied as a retrieval-layer filter, so results are ranked within the changed files rather than intersected against a bounded whole-codebase candidate pool (change sets over 500 files fall back to post-filtering).

search_changed("auth logic", since="HEAD~5")
search_changed("database", since="main")
search_changed("fix", since="abc123")
search_changed("config", since="3.days.ago")

Configuration

Variable Default Description
VECTOR_QDRANT_URL http://localhost:6333 Qdrant server
VECTOR_EMBEDDING_URL http://localhost:8080 OpenAI-compatible embeddings API
VECTOR_EMBEDDING_MODEL (required) Embedding model name (e.g., nomic-embed-text, text-embedding-3-small)
VECTOR_EMBEDDING_DIM (required) Vector dimension (must match your model, e.g., 768, 1536)

Changing the embedding model. A codebase's index is tied to the embedding model it was built with. If you switch VECTOR_EMBEDDING_MODEL, the next search or index of that codebase fails fast with a clear error pointing at force_reindex, instead of a cryptic Qdrant dimension error (different-dimension swap) or silently meaningless results from incompatible embedding spaces (same-dimension swap — the model name is recorded in each collection's metadata and checked on reuse). Run force_reindex on the affected codebase to rebuild it with the new model — each codebase is reindexed independently.

Codesearch-specific settings (configured via environment variables with the CODESEARCH_ prefix):

Variable Default Description
CODESEARCH_CLASS_SPLIT_THRESHOLD 50 Lines threshold for splitting large classes
CODESEARCH_CHUNK_MIN_LINES 10 Merge chunks smaller than this
CODESEARCH_CHUNK_MAX_LINES 500 Max lines per fallback chunk
CODESEARCH_CHUNK_OVERLAP_LINES 25 Overlap between fallback chunks
CODESEARCH_SEARCH_CACHE_MAX_SIZE 100 Max cached search results
CODESEARCH_SEARCH_CACHE_TTL_SECONDS 300 Search cache TTL (seconds)
CODESEARCH_SEARCH_CACHE_EVICTION_RATIO 0.2 Fraction of cache to evict when full
CODESEARCH_UPSERT_BATCH_TIMEOUT 300 Batch operation timeout (seconds)
CODESEARCH_UPSERT_CONCURRENCY 1 Max concurrent upsert batches
CODESEARCH_DELETION_CONCURRENCY 50 Concurrent Qdrant operations during incremental indexing

Change Detection

Fast incremental updates:

  1. Check mtime + size (skip unchanged files)
  2. Hash only modified files
  3. Re-index only changed chunks

Avoids full re-embedding on every search.

Ignoring files

File discovery honors gitignore-syntax exclude rules at every directory level:

  • .gitignore — nested .gitignore files are respected, matching git semantics (deeper files override shallower, ! negations re-include).
  • .git/info/exclude — repo-local excludes that are not committed to git.
  • .codesearchignore — exclude paths from indexing without changing git's behavior. Same syntax as .gitignore; useful for vendored code, generated files, or large data you want tracked by git but kept out of the index.

Ignored directories are pruned during traversal, so excluded subtrees cost nothing. The global core.excludesFile is intentionally not consulted, so indexing stays reproducible regardless of per-machine git configuration.

Storage

Data Location
Index Qdrant collection codesearch_{path_hash}
Metadata Stored in Qdrant point payloads

Each indexed codebase gets a unique collection based on path hash.

Supported Languages

Full tree-sitter AST support (18 languages): Python, JavaScript, TypeScript, Go, Rust, Java, C, C++, Ruby, PHP, Swift, Kotlin, Scala, C#, SQL, JSON, YAML, TOML

Line-based fallback: Bash, HTML, CSS, and all other file types (Markdown, Vue, Svelte, config files, etc.) are indexed with line-based chunking.

Jupyter notebooks (.ipynb): Notebooks are indexed by their code. Code cells are extracted (markdown, raw, and output cells are skipped) and chunked as Python with full AST support, so a notebook's functions and classes are searchable just like any other source file. Code-less or unparseable notebooks are skipped.

Dependencies

Requires vector-core components:

  • EmbeddingClient, GlobalVocabulary (embeddings)
  • QdrantStorage, HybridSearcher (storage)

External libraries:

  • tree-sitter-language-pack (AST parsing)
  • pathspec (.gitignore / .codesearchignore support)

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选