
Python Codebase Analysis RAG System
An MCP server that analyzes Python codebases using AST, stores code elements in a vector database, and enables natural language queries about code structure and functionality using RAG with Google's Gemini models.
README
Python Codebase Analysis RAG System
This system analyzes Python code using Abstract Syntax Trees (AST), stores the extracted information (functions, classes, calls, variables, etc.) in a Weaviate vector database, and provides tools for querying and understanding the codebase via a Model Context Protocol (MCP) server. It leverages Google's Gemini models for generating embeddings and natural language descriptions/answers.
Features
- Code Scanning: Parses Python files to identify code elements (functions, classes, imports, calls, assignments) and their relationships. Extracts:
- Basic info: Name, type, file path, line numbers, code snippet, docstring.
- Function/Method details: Parameters, return type, signature, decorators.
- Scope info: Parent scope (class/function) UUID, readable ID (e.g.,
file:type:name:line
), base class names. - Usage info: Attribute accesses within scopes, call relationships (partially tracked).
- Vector Storage: Uses Weaviate to store code elements and their vector embeddings (when LLM generation is enabled).
- LLM Enrichment (Optional & Background): Generates semantic descriptions and embeddings for functions and classes using Gemini. This now runs as background tasks triggered after scanning or manually. Can be enabled/disabled via the
.env
file. - Automatic Refinement (Optional & Background): When LLM generation is enabled, automatically refines descriptions for new/updated functions using context (callers, callees, siblings, related variables) as part of the background processing.
- RAG Q&A: Answers natural language questions about the codebase using Retrieval-Augmented Generation (requires LLM features enabled and background processing completed).
- User Clarifications: Allows users to add manual notes to specific code elements.
- Visualization: Generates MermaidJS call graphs based on stored relationships.
- MCP Server: Exposes analysis and querying capabilities through MCP tools, managing codebases and an active codebase context.
- File Watcher (Integrated): Automatically starts when a codebase is scanned (
scan_codebase
) and stops when another codebase is selected (select_codebase
) or the codebase is deleted (delete_codebase
). Triggers re-analysis and database updates for the active codebase when its files change. Can also be manually controlled viastart_watcher
andstop_watcher
tools. - Codebase Dependencies: Allows defining dependencies between scanned codebases (
add_codebase_dependency
,remove_codebase_dependency
). - Cross-Codebase Querying: Enables searching (
find_element
) and asking questions (ask_question
) across the active codebase and its declared dependencies.
Setup
-
Environment: Ensure Python 3.10+ and Docker are installed.
-
Weaviate: Start the Weaviate instance using Docker Compose:
docker-compose up -d
-
Dependencies: Install Python packages:
pip install -r requirements.txt
-
API Key & Configuration: Create a
.env
file in the project root and add your Gemini API key. You can also configure other settings:# --- Required --- GEMINI_API_KEY=YOUR_API_KEY_HERE # --- Optional --- # Set to true to enable background LLM description generation and refinement GENERATE_LLM_DESCRIPTIONS=true # Max concurrent background LLM tasks (embeddings/descriptions/refinements) LLM_CONCURRENCY=5 # ANALYZE_ON_STARTUP is no longer used. Scanning is done via the scan_codebase tool. # Specify Weaviate connection details if not using defaults # WEAVIATE_HOST=localhost # WEAVIATE_PORT=8080 # WEAVIATE_GRPC_PORT=50051 # Specify alternative Gemini models if desired # GENERATION_MODEL_NAME="models/gemini-pro" # EMBEDDING_MODEL_NAME="models/embedding-001" # Adjust Weaviate batch size # WEAVIATE_BATCH_SIZE=100 # SEMANTIC_SEARCH_LIMIT=5 # SEMANTIC_SEARCH_DISTANCE=0.7 # Watcher polling interval (seconds) # WATCHER_POLLING_INTERVAL=5
-
Run MCP Server: Start the server in a separate terminal:
python src/code_analysis_mcp/mcp_server.py
(Ensure this terminal stays running for the tools to be available)
Architecture Overview
This system analyzes Python code, stores the extracted information in a Weaviate vector database, and provides tools for querying and understanding the codebase via a Model Context Protocol (MCP) server. It leverages Google's Gemini models for generating embeddings and natural language descriptions/answers.
The main modules are:
code_scanner.py
: Finds Python files, parses them using AST, extracts structural elements (functions, classes, imports, calls, etc.), and prepares data for Weaviate.weaviate_client.py
: Manages the connection to Weaviate, defines the data schema (CodeFile
,CodeElement
,CodebaseRegistry
), and provides functions for batch uploading, querying, updating, and deleting data.rag.py
: Implements Retrieval-Augmented Generation (RAG) for answering questions about the codebase. It uses semantic search to find relevant code elements and an LLM to synthesize an answer.mcp_server.py
: Sets up the FastMCP server, manages codebases in aCodebaseRegistry
collection, handles the active codebase context (ACTIVE_CODEBASE_NAME
), integrates file watching logic (including automatic start/stop), manages codebase dependencies, and exposes analysis functionalities as MCP tools with detailed argument descriptions.visualization.py
: Generates MermaidJS call graphs based on stored relationships.
The system uses Weaviate's multi-tenancy feature for CodeFile
and CodeElement
collections, where the tenant ID is the user-defined codebase_name
. A separate, non-multi-tenant CodebaseRegistry
collection tracks codebase metadata (name, directory, status, summary, watcher status, dependencies). The ACTIVE_CODEBASE_NAME
global variable in the server determines the primary codebase tenant for queries. Query tools (find_element
, ask_question
) can optionally search across the active codebase and its declared dependencies stored in the registry. The list_codebases
tool can be used to view the status and dependencies of all codebases.
Background LLM processing is used to generate semantic descriptions and embeddings for code elements. This is an optional feature that can be enabled/disabled via the .env
file.
Detailed information on the available tools and their arguments can be retrieved directly from the MCP server using standard MCP introspection methods once the server is running.
推荐服务器

Baidu Map
百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Playwright MCP Server
一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。
Magic Component Platform (MCP)
一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。
Audiense Insights MCP Server
通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

VeyraX
一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。
graphlit-mcp-server
模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。
Kagi MCP Server
一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

e2b-mcp-server
使用 MCP 通过 e2b 运行代码。
Neon MCP Server
用于与 Neon 管理 API 和数据库交互的 MCP 服务器
Exa MCP Server
模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。