MCP Web Docs

MCP Web Docs

A self-hosted MCP server that crawls, indexes, and searches documentation from any website locally, including private sites requiring authentication. It provides hybrid search capabilities and local embedding generation to maintain privacy while keeping AI assistant knowledge up to date.

Category
访问服务器

README

MCP Web Docs

npm version npm downloads License: MIT Node.js CI

Index Any Documentation. Search Locally. Stay Private.

A self-hosted Model Context Protocol (MCP) server that crawls, indexes, and searches documentation from any website. Unlike remote MCP servers limited to GitHub repos or pre-indexed libraries, web-docs gives you full control over what gets indexed — including private documentation behind authentication.

FeaturesInstallationQuick StartToolsTipsTroubleshootingContributing


❌ The Problem

AI assistants struggle with documentation:

  • Remote MCP servers only work with GitHub or pre-indexed libraries
  • Private docs behind authentication can't be accessed
  • Outdated indexes don't reflect your team's latest documentation
  • No control over what gets indexed or when

✅ The Solution

MCP Web Docs crawls and indexes documentation from ANY website locally:

  • Any website - Docusaurus, Storybook, GitBook, custom sites, internal wikis
  • Private docs - Interactive browser login for authenticated sites
  • Always fresh - Re-index anytime with one command
  • Your data, your machine - No API keys, no cloud, full privacy

✨ Features

  • 🌐 Universal Crawler - Works with any documentation site, not just GitHub
  • 🔍 Hybrid Search - Combines full-text search (FTS) with semantic vector search
  • 📂 Collections - Group related docs into named collections for project-based organization
  • 🏷️ Tags & Categories - Organize docs with tags and filter searches by project, team, or category
  • 📦 Version Support - Index multiple versions of the same package (e.g., React 18 and 19)
  • 🔐 Authentication Support - Crawl private/protected docs with interactive browser login (auto-detects your default browser)
  • 📊 Smart Extraction - Automatically extracts code blocks, props tables, and structured content
  • ⚡ Local Embeddings - Uses FastEmbed for fast, private embedding generation (no API keys)
  • 🗄️ Persistent Storage - LanceDB for vectors, SQLite for metadata
  • 🔄 Real-time Progress - Track indexing status with progress updates

🚀 Installation

Prerequisites

  • Node.js >= 22.19.0

Option 1: Install from NPM (Recommended)

npm install -g @cosmocoder/mcp-web-docs

Option 2: Run with npx

No installation required - just configure your MCP client to use npx (see below).

Option 3: Build from Source

# Clone the repository
git clone https://github.com/cosmocoder/mcp-web-docs.git
cd mcp-web-docs

# Install dependencies (automatically installs Playwright browsers)
npm install

# Build
npm run build

Configure Your MCP Client

<details> <summary><b>Cursor</b></summary>

Add to your Cursor MCP settings (~/.cursor/mcp.json):

Using npx (no install required):

{
  "mcpServers": {
    "web-docs": {
      "command": "npx",
      "args": ["-y", "@cosmocoder/mcp-web-docs"]
    }
  }
}

Using global install:

{
  "mcpServers": {
    "web-docs": {
      "command": "mcp-web-docs"
    }
  }
}

Using local build:

{
  "mcpServers": {
    "web-docs": {
      "command": "node",
      "args": ["/path/to/mcp-web-docs/build/index.js"]
    }
  }
}

</details>

<details> <summary><b>Claude Desktop</b></summary>

Add to your Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json on macOS):

Using npx:

{
  "mcpServers": {
    "web-docs": {
      "command": "npx",
      "args": ["-y", "@cosmocoder/mcp-web-docs"]
    }
  }
}

Using global install:

{
  "mcpServers": {
    "web-docs": {
      "command": "mcp-web-docs"
    }
  }
}

</details>

<details> <summary><b>VS Code</b></summary>

Add to .vscode/mcp.json in your workspace:

Using npx:

{
  "servers": {
    "web-docs": {
      "command": "npx",
      "args": ["-y", "@cosmocoder/mcp-web-docs"]
    }
  }
}

Using global install:

{
  "servers": {
    "web-docs": {
      "command": "mcp-web-docs"
    }
  }
}

</details>

<details> <summary><b>Windsurf</b></summary>

Add to ~/.codeium/windsurf/mcp_config.json:

Using npx:

{
  "mcpServers": {
    "web-docs": {
      "command": "npx",
      "args": ["-y", "@cosmocoder/mcp-web-docs"]
    }
  }
}

Using global install:

{
  "mcpServers": {
    "web-docs": {
      "command": "mcp-web-docs"
    }
  }
}

</details>

<details> <summary><b>Cline</b></summary>

Add to ~/Library/Application Support/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.json:

Using npx:

{
  "mcpServers": {
    "web-docs": {
      "command": "npx",
      "args": ["-y", "@cosmocoder/mcp-web-docs"],
      "disabled": false,
      "autoApprove": []
    }
  }
}

Using global install:

{
  "mcpServers": {
    "web-docs": {
      "command": "mcp-web-docs",
      "disabled": false,
      "autoApprove": []
    }
  }
}

</details>

<details> <summary><b>RooCode</b></summary>

Global configuration: Open RooCode → Click MCP icon → "Edit Global MCP"

Project-level configuration: Create .roo/mcp.json at your project root

Using npx:

{
  "mcpServers": {
    "web-docs": {
      "command": "npx",
      "args": ["-y", "@cosmocoder/mcp-web-docs"]
    }
  }
}

Using global install:

{
  "mcpServers": {
    "web-docs": {
      "command": "mcp-web-docs"
    }
  }
}

</details>


⚡ Quick Start

1. Index public documentation

Index the LanceDB documentation from https://lancedb.com/docs/

The AI assistant will call add_documentation and begin crawling.

2. Search for information

How do I create a table in LanceDB?

The AI will use search_documentation to find relevant content.

3. For private docs, authenticate first

I need to index private documentation at https://internal.company.com/docs/
It requires authentication.

A browser window will open for you to log in. The session is saved for future crawls.


🔨 Available Tools

add_documentation

Add a new documentation site for indexing.

add_documentation({
  url: "https://docs.example.com/",
  title: "Example Docs",              // Optional
  id: "example-docs",                 // Optional custom ID
  tags: ["frontend", "mycompany"],    // Optional tags for categorization
  version: "2.0",                     // Optional version for versioned packages
  auth: {                             // Optional authentication
    requiresAuth: true,
    // browser auto-detected from OS settings if omitted
    loginTimeoutSecs: 300
  }
})

search_documentation

Search through indexed documentation using hybrid search (FTS + semantic).

search_documentation({
  query: "how to configure authentication",
  url: "https://docs.example.com/",    // Optional: filter to specific site
  tags: ["frontend", "mycompany"],     // Optional: filter by tags
  limit: 10                            // Optional: max results
})

authenticate

Open a browser window for interactive login to protected sites. Your default browser is automatically detected from OS settings.

authenticate({
  url: "https://private-docs.example.com/",
  // browser auto-detected from OS settings - only specify to override
  loginTimeoutSecs: 300         // Optional: timeout in seconds
})

list_documentation

List all indexed documentation sites with their metadata including tags.

set_tags

Set or update tags for a documentation site. Tags help categorize and filter documentation.

set_tags({
  url: "https://docs.example.com/",
  tags: ["frontend", "react", "mycompany"]  // Replaces existing tags
})

list_tags

List all available tags with usage counts. Useful to see what tags exist across your indexed docs.

reindex_documentation

Re-crawl and re-index a specific documentation site.

get_indexing_status

Get the current status of indexing operations.

delete_documentation

Delete an indexed documentation site and all its data.

clear_auth

Clear saved authentication session for a domain.

Collection Tools

Collections let you group related documentation sites together for project-based organization. Unlike tags (which categorize individual docs), collections create named workspaces like "My React Project" containing React + Next.js + TypeScript docs.

create_collection

Create a new collection to group documentation sites.

create_collection({
  name: "My React Project",
  description: "React, Next.js, and TypeScript docs for my project"  // Optional
})

add_to_collection

Add indexed documentation sites to a collection.

add_to_collection({
  name: "My React Project",
  urls: [
    "https://react.dev/",
    "https://nextjs.org/docs/",
    "https://www.typescriptlang.org/docs/"
  ]
})

search_collection

Search within a specific collection. Uses the same hybrid search as search_documentation but limited to docs in the collection.

search_collection({
  name: "My React Project",
  query: "server components data fetching",
  limit: 10  // Optional
})

list_collections

List all collections with their document counts.

get_collection

Get details of a specific collection including all its documentation sites.

get_collection({
  name: "My React Project"
})

update_collection

Rename a collection or update its description.

update_collection({
  name: "My React Project",
  newName: "Frontend Stack",           // Optional
  description: "Updated description"   // Optional
})

remove_from_collection

Remove documentation sites from a collection. The sites remain indexed, just removed from the collection.

remove_from_collection({
  name: "My React Project",
  urls: ["https://old-library.dev/docs/"]
})

delete_collection

Delete a collection. The documentation sites in the collection are not deleted, only the collection grouping.

delete_collection({
  name: "Old Project"
})

💡 Tips

Crafting Better Search Queries

The search uses hybrid full-text and semantic search. For best results:

  1. Be specific - Include unique terms from what you're looking for

    • Instead of: "Button props"
    • Try: "Button props onClick disabled loading"
  2. Use exact phrases - Wrap in quotes for exact matching

    • "authentication middleware" finds that exact phrase
  3. Include context - Add related terms to narrow results

    • API docs: "GET /users endpoint authentication headers"
    • Config: "webpack config entry output plugins"

Auto-Invoke with Rules

To avoid typing search instructions in every prompt, add a rule to your MCP client:

Cursor (Cursor Settings > Rules):

When I ask about library documentation or need code examples,
use the web-docs MCP server to search indexed documentation.

Windsurf (.windsurfrules):

Always use web-docs search_documentation when I ask about
API references, configuration, or library usage.

Scoping Searches

If you have multiple sites indexed, filter by URL or tags:

// Filter by specific site URL
search_documentation({
  query: "routing",
  url: "https://nextjs.org/docs/"
})

// Filter by tags (searches all docs with matching tags)
search_documentation({
  query: "Button component",
  tags: ["frontend", "mycompany"]  // Only docs tagged with BOTH tags
})

Organizing with Tags

Tags help organize documentation when you have multiple related sites. Add tags when indexing:

// Index frontend package docs
add_documentation({
  url: "https://docs.mycompany.com/ui-components/",
  tags: ["frontend", "mycompany", "react"]
})

// Index backend API docs
add_documentation({
  url: "https://docs.mycompany.com/api/",
  tags: ["backend", "mycompany", "api"]
})

Later, search across all frontend docs:

search_documentation({
  query: "authentication",
  tags: ["frontend"]  // Searches all frontend-tagged docs
})

You can also add tags to existing documentation with set_tags.

Using Collections for Project Organization

Collections provide a higher-level grouping than tags — they let you organize documentation by project or context, making it easy to switch between different work contexts.

Create a collection for your project:

create_collection({
  name: "E-commerce Backend",
  description: "All docs for the backend rewrite project"
})

Add relevant documentation:

add_to_collection({
  name: "E-commerce Backend",
  urls: [
    "https://fastapi.tiangolo.com/",
    "https://docs.sqlalchemy.org/",
    "https://redis.io/docs/"
  ]
})

Search within your project context:

search_collection({
  name: "E-commerce Backend",
  query: "connection pooling best practices"
})

Collections vs Tags:

Feature Collections Tags
Purpose Group docs as a project/workspace Categorize individual docs
Structure Named container with multiple docs Labels on individual docs
Use case "My React Project" with React + Next.js + TS "This doc is about React"
Searching search_collection for focused results tags filter in search_documentation

You can use both together — a document can have tags AND belong to multiple collections.

Versioning Package Documentation

When indexing documentation for versioned packages (React, Vue, Python libraries, etc.), you can specify the version to track which version you've indexed:

// Index React 18 docs
add_documentation({
  url: "https://18.react.dev/",
  title: "React 18 Docs",
  version: "18"
})

// Index React 19 docs (different URL)
add_documentation({
  url: "https://react.dev/",
  title: "React 19 Docs",
  version: "19"
})

The version is displayed in list_documentation output and preserved when re-indexing. Version formats are flexible — use whatever makes sense for your package (e.g., "18", "v6.4", "3.11", "latest").

Note: Version is optional and mainly useful for software packages with multiple versions. For internal documentation, wikis, or single-version products, you can skip the version field.


🚨 Troubleshooting

<details> <summary><b>"Failed to parse document content"</b></summary>

The content extractor couldn't process the page. Try:

  • Re-indexing the documentation
  • Checking if the site uses JavaScript rendering (should work with Playwright)
  • Looking at the crawled data in ~/.mcp-web-docs/crawlee/datasets/

</details>

<details> <summary><b>Authentication not working</b></summary>

  • Make sure you call authenticate before add_documentation
  • The browser window needs to stay open until login is detected
  • For OAuth sites, complete the full flow manually
  • Your default browser is auto-detected; specify a different one with browser: "firefox", for example, if needed

</details>

<details> <summary><b>Search not returning expected results</b></summary>

  • Try more specific queries with unique terms
  • Use quotes for exact phrase matching
  • Filter by URL to search within a specific documentation site
  • Re-index if the documentation has been updated

</details>

<details> <summary><b>Playwright browser issues</b></summary>

If browsers aren't installed, run:

npx playwright install

</details>


Data Storage

All data is stored locally in ~/.mcp-web-docs/:

~/.mcp-web-docs/
├── docs.db           # SQLite database for document metadata
├── vectors/          # LanceDB vector database
├── sessions/         # Saved authentication sessions
└── crawlee/          # Crawlee datasets (cached crawl data)

📄 License

MIT License - see LICENSE for details.


🙏 Acknowledgments

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选