Crawl4AI+SearXNG MCP Server

Crawl4AI+SearXNG MCP Server

Provides AI agents with a comprehensive web intelligence stack including crawling, private search via SearXNG, and intelligent RAG capabilities for focused content extraction. It supports advanced features like semantic vector search and knowledge graph integration for code validation to enhance AI performance and reliability.

Category
访问服务器

README

🐳 Crawl4AI+SearXNG MCP Server

<em>Web Crawling, Search and RAG Capabilities for AI Agents and AI Coding Assistants</em>

CI/CD Pipeline Python 3.12+ Docker License: MIT

(FORKED FROM https://github.com/coleam00/mcp-crawl4ai-rag). Added SearXNG integration and batch scrape and processing capabilities.

A self-contained Docker solution that combines the Model Context Protocol (MCP), Crawl4AI, SearXNG, and Supabase to provide AI agents and coding assistants with complete web search, crawling, and RAG capabilities.

🚀 Complete Stack in One Command: Deploy everything with make prod - no Python setup, no dependencies, no external services required.

🎯 Smart RAG vs Traditional Scraping

Unlike traditional scraping (such as Firecrawl) that dumps raw content and overwhelms LLM context windows, this solution uses intelligent RAG (Retrieval Augmented Generation) to:

  • 🔍 Extract only relevant content using semantic similarity search
  • ⚡ Prevent context overflow by returning focused, pertinent information
  • 🧠 Enhance AI responses with precisely targeted knowledge
  • 📊 Maintain context efficiency for better LLM performance

Flexible Output Options:

  • RAG Mode (default): Returns semantically relevant chunks with similarity scores
  • Raw Markdown Mode: Full content extraction when complete context is needed
  • Hybrid Search: Combines semantic and keyword search for comprehensive results

💡 Key Benefits

  • 🔧 Zero Configuration: Pre-configured SearXNG instance included
  • 🐳 Docker-Only: No Python environment setup required
  • 🔍 Integrated Search: Built-in SearXNG for private, fast search
  • ⚡ Production Ready: HTTPS, security, and monitoring included
  • 🎯 AI-Optimized: RAG strategies built for coding assistants

🗺️ Project Roadmap

📍 Current Focus: Agentic Search (Highest Priority)

We are implementing an intelligent, iterative search system that combines local knowledge, web search, and LLM-driven decision making to provide comprehensive answers while minimizing costs.

Why this matters:

  • 🚀 Unique value proposition - no other MCP server offers this
  • 💰 50-70% cost reduction through selective crawling
  • 🎯 High-quality, complete answers without manual iteration
  • 🏆 Positions this as the most advanced RAG-MCP solution

📖 Full Roadmap: See docs/PROJECT_ROADMAP.md - the single source of truth for all development priorities.

📐 Architecture: See docs/AGENTIC_SEARCH_ARCHITECTURE.md for technical details.


Overview

This Docker-based MCP server provides a complete web intelligence stack that enables AI agents to:

  • Search the web using the integrated SearXNG instance
  • Crawl and scrape websites with advanced content extraction
  • Store content in vector databases with intelligent chunking
  • Perform RAG queries with multiple enhancement strategies

Advanced RAG Strategies Available:

  • Contextual Embeddings for enriched semantic understanding
  • Hybrid Search combining vector and keyword search
  • Agentic RAG for specialized code example extraction
  • Reranking for improved result relevance using cross-encoder models
  • Knowledge Graph for AI hallucination detection and repository code analysis

See the Configuration section below for details on how to enable and configure these strategies.

Features

  • Contextual Embeddings: Enhanced RAG with LLM-generated context for each chunk, improving search accuracy by 20-30% (Learn more)
  • Smart URL Detection: Automatically detects and handles different URL types (regular webpages, sitemaps, text files)
  • Recursive Crawling: Follows internal links to discover content
  • Parallel Processing: Efficiently crawls multiple pages simultaneously
  • Content Chunking: Intelligently splits content by headers and size for better processing
  • Vector Search: Performs RAG over crawled content, optionally filtering by data source for precision
  • Source Retrieval: Retrieve sources available for filtering to guide the RAG process

Tools

The server provides essential web crawling and search tools:

Core Tools (Always Available)

  1. scrape_urls: Scrape one or more URLs and store their content in the vector database. Supports both single URLs and lists of URLs for batch processing.
  2. smart_crawl_url: Intelligently crawl a full website based on the type of URL provided (sitemap, llms-full.txt, or a regular webpage that needs to be crawled recursively)
  3. get_available_sources: Get a list of all available sources (domains) in the database
  4. perform_rag_query: Search for relevant content using semantic search with optional source filtering
  5. NEW! search: Comprehensive web search tool that integrates SearXNG search with automated scraping and RAG processing. Performs a complete workflow: (1) searches SearXNG with the provided query, (2) extracts URLs from search results, (3) automatically scrapes all found URLs using existing scraping infrastructure, (4) stores content in vector database, and (5) returns either RAG-processed results organized by URL or raw markdown content. Key parameters: query (search terms), return_raw_markdown (bypasses RAG for raw content), num_results (search result limit), batch_size (database operation batching), max_concurrent (parallel scraping sessions). Ideal for research workflows, competitive analysis, and content discovery with built-in intelligence.

Conditional Tools

  1. search_code_examples (requires USE_AGENTIC_RAG=true): Search specifically for code examples and their summaries from crawled documentation. This tool provides targeted code snippet retrieval for AI coding assistants.

Knowledge Graph Tools (requires USE_KNOWLEDGE_GRAPH=true, see below)

🚀 NEW: Multi-Language Repository Parsing - The system now supports comprehensive analysis of repositories containing Python, JavaScript, TypeScript, Go, and other languages. See Multi-Language Parsing Documentation for complete details.

  1. parse_github_repository: Parse a GitHub repository into a Neo4j knowledge graph, extracting classes, methods, functions, and their relationships across multiple programming languages (Python, JavaScript, TypeScript, Go, etc.)
  2. parse_local_repository: Parse local Git repositories directly without cloning, supporting multi-language codebases
  3. parse_repository_branch: Parse specific branches of repositories for version-specific analysis
  4. analyze_code_cross_language: NEW! Perform semantic search across multiple programming languages to find similar patterns (e.g., "authentication logic" across Python, JavaScript, and Go)
  5. check_ai_script_hallucinations: Analyze Python scripts for AI hallucinations by validating imports, method calls, and class usage against the knowledge graph
  6. query_knowledge_graph: Explore and query the Neo4j knowledge graph with commands like repos, classes, methods, and custom Cypher queries
  7. get_script_analysis_info: Get information about script analysis setup, available paths, and usage instructions for hallucination detection tools

🔍 Code Search and Validation

Advanced Neo4j-Qdrant Integration for Reliable AI Code Generation

The system provides sophisticated code search and validation capabilities by combining:

  • Qdrant: Semantic vector search for finding relevant code examples
  • Neo4j: Structural validation against parsed repository knowledge graphs
  • AI Hallucination Detection: Prevents AI from generating non-existent methods or incorrect usage patterns

When to Use Neo4j vs Qdrant

Use Case Neo4j (Knowledge Graph) Qdrant (Vector Search) Combined Approach
Exact Structure Validation ✅ Perfect - validates class/method existence ❌ Cannot verify structure 🏆 Best - structure + semantics
Semantic Code Search ❌ Limited - no semantic understanding ✅ Perfect - finds similar patterns 🏆 Best - validated similarity
Hallucination Detection ✅ Good - catches structural errors ❌ Cannot detect fake methods 🏆 Best - comprehensive validation
Code Discovery ❌ Requires exact names ✅ Perfect - fuzzy semantic search 🏆 Best - discovered + validated
Performance ⚡ Fast for exact queries ⚡ Fast for semantic search ⚖️ Balanced - parallel validation

Enhanced Tools for Code Search and Validation

14. smart_code_search (requires both USE_KNOWLEDGE_GRAPH=true and USE_AGENTIC_RAG=true)

Intelligent code search that combines Qdrant semantic search with Neo4j structural validation:

  • Semantic Discovery: Find code patterns using natural language queries
  • Structural Validation: Verify all code examples against real repository structure
  • Confidence Scoring: Get reliability scores for each result (0.0-1.0)
  • Validation Modes: Choose between "fast", "balanced", or "thorough" validation
  • Intelligent Fallback: Works even when one system is unavailable

15. extract_and_index_repository_code (requires both systems)

Bridge Neo4j knowledge graph data into Qdrant for searchable code examples:

  • Knowledge Graph Extraction: Pull structured code from Neo4j
  • Semantic Indexing: Generate embeddings and store in Qdrant
  • Rich Metadata: Preserve class/method relationships and context
  • Batch Processing: Efficient indexing of large repositories

16. check_ai_script_hallucinations_enhanced (requires both systems)

Advanced hallucination detection using dual validation:

  • Neo4j Structural Check: Validate against actual repository structure
  • Qdrant Semantic Check: Find similar real code examples
  • Combined Confidence: Merge validation results for higher accuracy
  • Code Suggestions: Provide corrections from real code examples

Basic Workflow

  1. Index Repository Structure:

    parse_github_repository("https://github.com/pydantic/pydantic-ai.git")
    
  2. Extract and Index Code Examples:

    extract_and_index_repository_code("pydantic-ai")
    
  3. Search with Validation:

    smart_code_search(
      query="async function with error handling",
      source_filter="pydantic-ai",
      min_confidence=0.7,
      validation_mode="balanced"
    )
    
  4. Validate AI Code:

    check_ai_script_hallucinations_enhanced("/path/to/ai_script.py")
    

📁 Using Hallucination Detection Tools

The hallucination detection tools require access to Python scripts. The Docker container includes volume mounts for convenient script analysis:

Script Locations:

  • ./analysis_scripts/user_scripts/ - Place your Python scripts here (recommended)
  • ./analysis_scripts/test_scripts/ - For test scripts
  • ./analysis_scripts/validation_results/ - Results are automatically saved here

Quick Start:

  1. Create a script: echo "import pandas as pd" > ./analysis_scripts/user_scripts/test.py
  2. Run validation: Use the check_ai_script_hallucinations tool with script_path="test.py"
  3. Check results: View detailed analysis in ./analysis_scripts/validation_results/

Path Translation: The system automatically translates relative paths to container paths, making it convenient to reference scripts by filename.

Quick Start

Prerequisites

  • Docker and Docker Compose
  • Make (optional, for convenience commands)
  • 8GB+ available RAM for all services

1. Start the Stack

Production deployment:

git clone https://github.com/krashnicov/crawl4aimcp.git
cd crawl4aimcp
make prod  # Starts all services in production mode

Development deployment:

make dev   # Starts services with hot reloading and debug logging

2. Configure Claude Desktop (or other MCP client)

Add the MCP server to your claude_desktop_config.json:

{
  "mcpServers": {
    "crawl4ai-mcp": {
      "command": "docker",
      "args": [
        "exec", "-i", "crawl4aimcp-mcp-1",
        "uv", "run", "python", "src/main.py"
      ],
      "env": {
        "USE_KNOWLEDGE_GRAPH": "true"
      }
    }
  }
}

3. Test the Connection

Try these commands in Claude to verify everything works:

Use the search tool to find information about "FastAPI authentication"
Use the scrape_urls tool to scrape https://fastapi.tiangolo.com/tutorial/security/
Parse this GitHub repository: https://github.com/fastapi/fastapi

4. Multi-Language Repository Analysis

Test the new multi-language capabilities:

Parse a multi-language repository: https://github.com/microsoft/vscode
Search for authentication patterns across Python, JavaScript, and TypeScript

🏗️ Architecture

The system consists of several Docker services working together:

Core Services

  • MCP Server: FastMCP-based server exposing all tools
  • Crawl4AI: Advanced web crawling and content extraction
  • SearXNG: Privacy-focused search engine (no external API keys)
  • Supabase: PostgreSQL + pgvector for embeddings and RAG
  • Neo4j: (Optional) Knowledge graph for code structure and hallucination detection
  • Qdrant: (Optional) Alternative vector database with advanced features

Data Flow

Search Query → SearXNG → URL Extraction → Crawl4AI → Content Processing → Vector Storage → RAG Query → Results
Repository → Multi-Language Parser → Neo4j Knowledge Graph → Code Validation → Hallucination Detection

Configuration

The system supports extensive configuration through environment variables:

Core Configuration

# Basic Configuration
USE_SUPABASE=true                    # Enable Supabase for vector storage
USE_QDRANT=false                     # Use Qdrant instead of Supabase (optional)
USE_KNOWLEDGE_GRAPH=true             # Enable Neo4j for code analysis
USE_AGENTIC_RAG=true                 # Enable advanced RAG features

# Search Configuration  
SEARXNG_URL=http://searxng:8080      # Internal SearXNG URL
CRAWL4AI_URL=http://crawl4ai:8000    # Internal Crawl4AI URL

# Multi-Language Repository Parsing
NEO4J_BATCH_SIZE=50                  # Batch size for large repository processing
NEO4J_BATCH_TIMEOUT=120              # Timeout for batch operations
REPO_MAX_SIZE_MB=500                 # Maximum repository size
REPO_MAX_FILE_COUNT=10000            # Maximum number of files

Advanced RAG Configuration

# Contextual Embeddings (improves search accuracy by 20-30%)
USE_CONTEXTUAL_EMBEDDINGS=false      # Requires OpenAI API or compatible LLM
LLM_PROVIDER=openai                  # openai, anthropic, groq, etc.
OPENAI_API_KEY=your_key_here         # Required for contextual embeddings

# Hybrid Search (combines vector + keyword search)
USE_HYBRID_SEARCH=false              # Requires PostgreSQL full-text search

# Cross-encoder Reranking (improves result relevance)
USE_RERANKING=false                  # Uses sentence-transformers reranking
RERANKING_MODEL=cross-encoder/ms-marco-MiniLM-L-12-v2

Multi-Language Repository Support

The system now provides comprehensive support for multi-language repositories:

Supported Languages

  • Python (.py) - Classes, functions, methods, imports, docstrings
  • JavaScript (.js, .jsx, .mjs, .cjs) - ES6+ features, React components
  • TypeScript (.ts, .tsx) - Interfaces, types, enums, generics
  • Go (.go) - Structs, interfaces, methods, packages

Key Features

  • Unified Knowledge Graph: All languages stored in single Neo4j instance
  • Cross-Language Search: Find similar patterns across different languages
  • Language-Aware Analysis: Respects language-specific syntax and conventions
  • Repository Size Safety: Built-in validation prevents resource exhaustion
  • Batch Processing: Optimized for large multi-language repositories

Example Multi-Language Workflow

# Parse a full-stack repository
parse_github_repository("https://github.com/microsoft/vscode")

# Search across languages
analyze_code_cross_language(
  query="authentication middleware",
  languages=["python", "javascript", "typescript", "go"]
)

# Explore repository structure
query_knowledge_graph("explore vscode")

For complete documentation, see Multi-Language Parsing Guide.

Docker Services Detail

Service URLs (Development)

Volume Mounts

./analysis_scripts/          → /app/analysis_scripts/
./data/supabase/             → /var/lib/postgresql/data
./data/neo4j/                → /data
./data/qdrant/               → /qdrant/storage

Performance and Scaling

Resource Requirements

Minimum (Development):

  • 4GB RAM
  • 10GB disk space
  • 2 CPU cores

Recommended (Production):

  • 8GB+ RAM
  • 50GB+ disk space
  • 4+ CPU cores

Optimization Settings

# Large Repository Processing
export NEO4J_BATCH_SIZE=100
export NEO4J_BATCH_TIMEOUT=300
export REPO_MAX_SIZE_MB=1000

# High-Volume Crawling
export CRAWL4AI_MAX_CONCURRENT=20
export SUPABASE_MAX_CONNECTIONS=20

Troubleshooting

Common Issues

1. Services not starting:

# Check service status
docker-compose ps

# View logs
docker-compose logs mcp
docker-compose logs searxng

2. MCP connection issues:

# Test MCP server directly
docker exec -it crawl4aimcp-mcp-1 uv run python src/main.py

# Check Claude Desktop logs
tail -f ~/Library/Logs/Claude/mcp*.log

3. Multi-language parsing issues:

# Check Neo4j connection
docker-compose logs neo4j

# Verify language analyzers
docker exec crawl4aimcp-mcp-1 python -c "from src.knowledge_graph.analyzers.factory import AnalyzerFactory; print(AnalyzerFactory().get_supported_languages())"

4. Repository too large:

# Increase limits
export REPO_MAX_SIZE_MB=1000
export REPO_MAX_FILE_COUNT=15000

Getting Help

  • Documentation: Check the /docs directory for detailed guides
  • Issues: Report bugs on GitHub Issues
  • Logs: All services log to Docker, accessible via docker-compose logs [service]

Development

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make changes with proper documentation
  4. Add tests for new functionality
  5. Submit a pull request

Adding Language Support

To add support for new programming languages:

  1. Create analyzer in src/knowledge_graph/analyzers/
  2. Extend AnalyzerFactory to recognize file extensions
  3. Add language-specific patterns and parsing logic
  4. Update documentation and tests

See the Language Analyzer Development Guide for details.

Testing

Prerequisites: Start Qdrant for integration tests

# Note: No port mapping - only accessible from Docker network for security
docker run -d --name qdrant-test qdrant/qdrant

Run tests:

# Run unit tests
make test

# Run specific language analyzer tests  
make test-analyzers

# Run integration tests (requires Qdrant running)
make test-integration

# Or run with uv directly
uv run pytest tests/ --cov=src --cov-report=term-missing

License

This project is licensed under the MIT License - see the LICENSE file for details.

Credits

Development Tools

Import Verification

The repository includes comprehensive import verification tests to catch refactoring issues early:

# Run import tests (fast, <1 second)
uv run pytest tests/test_imports.py -v

# Run all modules import test
uv run python -m tests.test_imports

Pre-commit Hooks

Install git hooks for automatic code quality checks:

# Install hooks
./scripts/install-hooks.sh

# Hooks will run automatically on commit:
# ✅ Import verification (blocks commit if fails)
# ⚠️  Ruff linting (warnings only)

# Skip hooks for a specific commit
git commit --no-verify

The pre-commit hook ensures:

  • All modules can be imported without errors
  • No circular imports
  • Code passes basic linting checks

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选
Crawl4AI+SearXNG MCP Server