DocShark

DocShark

DocShark is an MCP server that scrapes and indexes documentation websites, enabling AI assistants to perform full-text searches on a local knowledge base built from public docs.

Category
访问服务器

README

🦈 DocShark

Built with Bun NPM Version MCP Compatible GitHub Release License: MIT

DocShark is a powerful MCP (Model Context Protocol) server designed to scrape, index, and search any documentation website. It creates a local, highly-searchable knowledge base from public documentation pages using FTS5 (Full-Text Search) and BM25 ranking, allowing AI assistants to query the latest docs effortlessly.


🚀 Features

  • Automated Crawling: Discovers pages via sitemap.xml with fallback to BFS link crawling.
  • Smart Extraction: Uses Readability and Turndown to extract main content and convert it to clean Markdown, filtering out navbars and sidebars.
  • Semantic Chunking: Splits content based on headings, preserving contextual headers for better AI understanding.
  • High-Performance Search: Built-in SQLite + FTS5 indexing with BM25 ranking for accurate and lightning-fast search results.
  • JS-Rendered Site Support: Tiered fetching strategy automatically detects React/Vue SPAs (empty shells) and upgrades to puppeteer-core if you have it installed (zero-config, auto-fallback).
  • Polite Crawling: Respects robots.txt and implements rate limiting to prevent overloading documentation servers.
  • Standard MCP Tooling: Connect perfectly with Desktop Claude, VS Code, Cursor, and any other MCP-compatible clients via standard stdio or http/sse transports.

📦 What We Have Done (Phase 1)

Phase 1: Core Engine is fully implemented and tested.

  • ✅ Custom SQLite Database with FTS5 virtual tables and auto-sync triggers.
  • ✅ Web scraping engine supporting standard fetch() and puppeteer-core.
  • ✅ Markdown processor utilizing Readability + Turndown.
  • ✅ Heading-based semantic chunker (500-1200 tokens per chunk).
  • ✅ Asynchronous job manager and queue system.
  • ✅ Complete HTTP API (REST endpoints + SSE event streams).
  • ✅ Seamless integration of 4 MCP tools: manage_library, search_docs, list_libraries, and get_doc_page.
  • ✅ Robust CLI interface (start, add, rename, search, list).

🏗️ What We Are Doing

We are actively polishing the integration between the core engine and external MCP clients (like VS Code Agents and Claude Desktop).

🔮 What We Plan To Do (Phase 2 & Beyond)

  • Web Dashboard: An intuitive SvelteKit dashboard to manage your synced libraries, view crawl progress in real-time (via SSE), and test searches manually.
  • Incremental Crawling: Smarter refresh jobs that compare ETag and Last-Modified headers to only re-scrape updated pages.
  • Vector Search (RAG): Integration of lightweight vector embeddings for semantic similarity search alongside the existing FTS5 keyword search.
  • Advanced Scraping Setup: Support for custom CSS selectors to define exactly where content lives in non-standard documentation websites.

🛠️ Usage

Quick Start (from npm)

You can run DocShark directly without installing it globally using bunx:

# Add a documentation library to the index
bunx docshark add https://valibot.dev/guides/ --depth 2

# Search your indexed docs
bunx docshark search "schema validation"

Installation

To install DocShark globally as a CLI tool:

DocShark is intended to be installed and run with Bun.

# Global Bun installation
bun add -g docshark

After installation, you can use the docshark command:

docshark list

# Update the global Bun installation when a new release is published
docshark update

# Script-friendly update check
docshark update --check --quiet

Interactive CLI runs will also let you know when a newer version is available. Update notices are intentionally skipped for MCP stdio mode so they never interfere with protocol output.

For scripts, docshark update --check exits 0 when current, 10 when a newer version is available, and 1 when the version check could not be completed.

🧠 Agent Skills

DocShark includes official Agent Skills available on the skills.sh registry. These skills teach AI assistants exactly how to set up, use, and troubleshoot the DocShark MCP server.

To install a skill directly into your AI coding assistant:

# Add the 'docshark' skill for using the MCP tools
npx skills add Michael-Obele/docshark --skill docshark

# Add the 'using-docshark' skill for setup and configuration help
npx skills add Michael-Obele/docshark --skill using-docshark

Skill Setup by Code Editor

The npx skills add CLI automatically configures skills for most editors, but here is how they integrate:

  • Cursor: Skills are added to .cursor/rules/
  • Windsurf: Skills are added to .windsurfrules
  • VS Code (Cline / Roo Code): Skills are added to .clinerules or .roomodes
  • Trae: Skills are added to .trae/skills/
  • GitHub Copilot: Skills are appended to .github/copilot-instructions.md

Check out the skills/README.md for detailed workflows on how these skills optimize your AI coding experience.

🔌 MCP Integration

VS Code (GitHub Copilot / MCP Extension)

Add DocShark to your .vscode/settings.json or global MCP configuration:

{
  "mcpServers": {
    "docshark": {
      "command": "bunx",
      "args": ["-y", "docshark", "start", "--stdio"]
    }
  }
}

Cursor

  1. Open Cursor Settings > Models > MCP.
  2. Click + Add New MCP Server.
  3. Name: docshark
  4. Type: command
  5. Command: bunx -y docshark start --stdio

Claude Desktop

Edit your Claude Desktop configuration file:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json
{
  "mcpServers": {
    "docshark": {
      "command": "bunx",
      "args": ["-y", "docshark", "start", "--stdio"]
    }
  }
}

🛠️ Development

Local Setup

Ensure you have Bun installed.

# Clone the repository
git clone https://github.com/Michael-Obele/docshark.git
cd docshark

# Install dependencies
bun install

# (Optional) Enable auto-detection & scraping of Javascript React/Vue single-page apps
bun add puppeteer-core

# Start the DocShark MCP server in HTTP mode for local testing
bun run src/cli.ts start --port 6380

Local CLI Debugging

# Run CLI directly while developing
bun run src/cli.ts list

Tests

Run the core regression suite before merging or publishing changes:

# From the repo root
pnpm test:core

# Or from packages/core
bun test scripts/*.test.ts

The suite covers the current core engine surfaces: SQLite storage and migrations, library management, extraction, chunking, search, crawl helpers, API routes, and MCP tool wrappers.

🔄 Versioning & Changelog

This project uses Google's Release Please to automate versioning and changelog generation.

  • Semantic Versioning: Our versions automatically bump (e.g. 0.0.1 -> 0.0.2 or 0.1.0) based on standard Conventional Commits (feat:, fix:, chore:, etc.).
  • Automated: A PR is automatically created on master when standard commits are merged, generating a standard CHANGELOG.md.

📜 License

This project is open-source and available under the MIT License.


Built to empower AI agents with the latest knowledge.

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选