AIMLPM/markcrawl

AIMLPM/markcrawl

Crawl any website into clean Markdown, search through pages, read full content, and extract structured data using OpenAI, Claude, Gemini, or Grok — with auto-citation and resume support.

Category
访问服务器

README

MarkCrawl by iD8 🕷️📝

Turn any website into clean Markdown for LLM pipelines — in one command.

CI PyPI Version License MCP Server

pip install markcrawl
markcrawl --base https://docs.example.com --out ./output --show-progress

MarkCrawl is a crawl-and-structure engine. It crawls a website, strips navigation/scripts/boilerplate, and writes clean Markdown files with a structured JSONL index. Every page includes a citation with the access date. No API keys needed.

Everything else — LLM extraction, Supabase upload, MCP server, LangChain tools — is optional and installed separately.

Quickstart (2 minutes)

pip install markcrawl
markcrawl --base https://httpbin.org --out ./demo --show-progress

Your ./demo folder now contains:

demo/
├── index__a4f3b2c1d0.md    ← clean Markdown of the page
└── pages.jsonl              ← structured index (one JSON line per page)

Each line in pages.jsonl:

{
  "url": "https://httpbin.org/",
  "title": "httpbin.org",
  "crawled_at": "2026-04-04T12:30:00Z",
  "citation": "httpbin.org. httpbin.org. Available at: https://httpbin.org/ [Accessed April 04, 2026].",
  "tool": "markcrawl",
  "text": "# httpbin.org\n\nA simple HTTP Request & Response Service..."
}

<details> <summary>How it compares to other crawlers</summary>

Different tools make different tradeoffs. This table summarizes the main differences:

MarkCrawl FireCrawl Crawl4AI Scrapy
License MIT AGPL-3.0 Apache-2.0 BSD-3
Install pip install markcrawl SaaS or self-host pip + Playwright pip + framework
Output Markdown + JSONL Markdown + JSON Markdown Custom pipelines
JS rendering Optional (--render-js) Built-in Built-in Plugin
LLM extraction Optional add-on Via API Built-in None
Best for Single-site crawl → Markdown Hosted scraping API AI-native crawling Large-scale distributed

Each tool has strengths: FireCrawl excels as a hosted API, Crawl4AI has deep browser automation, and Scrapy handles massive distributed workloads. MarkCrawl focuses on simple local crawls that produce LLM-ready Markdown.

See benchmarks/SPEED_COMPARISON.md for head-to-head performance data (3 tools, 4 sites, 3 iterations each). </details>

Installation

The core crawler is the only thing you need. Everything else is optional.

pip install markcrawl                # Core crawler (free, no API keys)

Optional add-ons:

pip install markcrawl[extract]       # + LLM extraction (OpenAI, Claude, Gemini, Grok)
pip install markcrawl[js]            # + JavaScript rendering (Playwright)
pip install markcrawl[upload]        # + Supabase upload with embeddings
pip install markcrawl[mcp]           # + MCP server for AI agents
pip install markcrawl[langchain]     # + LangChain tool wrappers
pip install markcrawl[all]           # Everything

For Playwright, also run playwright install chromium after installing.

<details> <summary>Install from source (for development)</summary>

git clone https://github.com/AIMLPM/markcrawl.git
cd markcrawl
python -m venv .venv
source .venv/bin/activate
pip install -e ".[all]"

</details>

Crawling

markcrawl --base https://www.example.com --out ./output --show-progress

Add flags as needed:

markcrawl \
  --base https://www.example.com \
  --out ./output \
  --include-subdomains \        # crawl sub.example.com too
  --render-js \                 # render JavaScript (React, Vue, etc.)
  --concurrency 5 \             # fetch 5 pages in parallel
  --proxy http://proxy:8080 \   # route through a proxy
  --max-pages 200 \             # stop after 200 pages
  --format markdown \           # or "text" for plain text
  --show-progress

Resume an interrupted crawl:

markcrawl --base https://www.example.com --out ./output --resume --show-progress

Output

Each page becomes a .md file with a citation header:

# Getting Started

> URL: https://docs.example.com/getting-started
> Crawled: April 04, 2026
> Citation: Getting Started. docs.example.com. Available at: https://docs.example.com/getting-started [Accessed April 04, 2026].

Welcome to the platform. This guide walks you through installation...

Navigation, footer, cookie banners, and scripts are stripped. Only the main content remains.

<details> <summary>All crawler CLI arguments</summary>

Argument Description
--base Base site URL to crawl
--out Output directory
--format markdown or text (default: markdown)
--show-progress Print progress and crawl events
--render-js Render JavaScript with Playwright before extracting
--concurrency Pages to fetch in parallel (default: 1)
--proxy HTTP/HTTPS proxy URL
--resume Resume from saved state
--include-subdomains Include subdomains under the base domain
--max-pages Max pages to save; 0 = unlimited (default: 500)
--delay Minimum delay between requests in seconds (default: 0, adaptive throttle adjusts automatically)
--timeout Per-request timeout in seconds (default: 15)
--min-words Skip pages with fewer words (default: 20)
--user-agent Override the default user agent
--use-sitemap / --no-sitemap Enable/disable sitemap discovery
</details>

Optional: structured extraction

If you need structured data (not just text), the extraction add-on uses an LLM to pull specific fields from each page.

pip install markcrawl[extract]

markcrawl-extract \
  --jsonl ./output/pages.jsonl \
  --fields company_name pricing features \
  --show-progress

Auto-discover fields across multiple crawled sites:

markcrawl-extract \
  --jsonl ./comp1/pages.jsonl ./comp2/pages.jsonl ./comp3/pages.jsonl \
  --auto-fields \
  --context "competitor pricing analysis" \
  --show-progress

Supports OpenAI, Anthropic (Claude), Google Gemini, and xAI (Grok) via --provider.

<details> <summary>Extraction details</summary>

Provider and model selection

markcrawl-extract --jsonl ... --fields pricing --provider openai         # default
markcrawl-extract --jsonl ... --fields pricing --provider anthropic      # Claude
markcrawl-extract --jsonl ... --fields pricing --provider gemini         # Gemini
markcrawl-extract --jsonl ... --fields pricing --provider grok           # Grok
markcrawl-extract --jsonl ... --fields pricing --model gpt-4o           # override model
Provider API key env var Default model
OpenAI OPENAI_API_KEY gpt-4o-mini
Anthropic ANTHROPIC_API_KEY claude-sonnet-4-20250514
Google Gemini GEMINI_API_KEY gemini-2.0-flash
xAI (Grok) XAI_API_KEY grok-3-mini-fast

All extraction CLI arguments

Argument Description
--jsonl Path(s) to pages.jsonl — pass multiple for cross-site analysis
--fields Field names to extract (space-separated)
--auto-fields Auto-discover fields by sampling pages
--context Describe your goal for auto-discovery
--sample-size Pages to sample for auto-discovery (default: 3)
--provider openai, anthropic, gemini, or grok
--model Override the default model
--output Output path (default: extracted.jsonl)
--delay Delay between LLM calls in seconds (default: 0.25)
--show-progress Print progress

Output format

Extracted rows include LLM attribution:

{
  "url": "https://competitor.com/pricing",
  "citation": "Pricing. competitor.com. Available at: ... [Accessed April 04, 2026].",
  "pricing_tiers": "Starter ($29/mo), Pro ($99/mo), Enterprise (contact sales)",
  "extracted_by": "gpt-4o-mini (openai)",
  "extraction_note": "Field values were extracted by an LLM and may be interpreted, not verbatim."
}

</details>

Optional: Supabase vector search (RAG)

Chunk pages, generate embeddings, and upload to Supabase with pgvector:

pip install markcrawl[upload]

markcrawl --base https://docs.example.com --out ./output --show-progress
markcrawl-upload --jsonl ./output/pages.jsonl --show-progress

Requires SUPABASE_URL, SUPABASE_KEY, and OPENAI_API_KEY. See docs/SUPABASE.md for table setup, query examples, and recommendations.

Optional: agent integrations

MarkCrawl includes integrations for AI agents. Each is an optional add-on.

<details> <summary>MCP Server (Claude Desktop, Cursor, Windsurf)</summary>

pip install markcrawl[mcp]
{
  "mcpServers": {
    "markcrawl": {
      "command": "python",
      "args": ["-m", "markcrawl.mcp_server"]
    }
  }
}

Tools: crawl_site, list_pages, read_page, search_pages, extract_data </details>

<details> <summary>LangChain Tool</summary>

pip install markcrawl[langchain]
from markcrawl.langchain import all_tools
from langchain_openai import ChatOpenAI
from langchain.agents import initialize_agent, AgentType

agent = initialize_agent(tools=all_tools, llm=ChatOpenAI(model="gpt-4o-mini"),
                         agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION)
agent.run("Crawl docs.example.com and summarize their auth guide")

</details>

<details> <summary>OpenClaw Skill (WhatsApp, Telegram, Slack)</summary>

npx clawhub install markcrawl-skill

See AIMLPM/markcrawl-clawhub-skill. </details>

<details> <summary>LLM assistant prompt</summary>

Copy the system prompt from docs/LLM_PROMPT.md into any LLM to get an assistant that generates correct MarkCrawl commands. </details>

When NOT to use MarkCrawl

  • Sites behind login/auth — no cookie or session support
  • Aggressive bot protection (Cloudflare, Akamai) — no anti-bot evasion
  • Millions of pages — designed for hundreds to low thousands; use Scrapy for scale
  • PDF content — HTML only (PDF support is on the roadmap)
  • JavaScript SPAs without --render-js — add markcrawl[js] for React/Vue/Angular

Architecture

MarkCrawl is a web crawler. The optional layers (extraction, upload, agents) are separate add-ons that work with the crawler's output.

CORE (free, no API keys)              OPTIONAL ADD-ONS
┌──────────────────────────┐
│ 1. Discover URLs         │          markcrawl[extract]  — LLM field extraction
│    (sitemap or links)    │          markcrawl[upload]   — Supabase/pgvector RAG
│ 2. Fetch & clean HTML    │          markcrawl[js]       — Playwright JS rendering
│ 3. Write Markdown + JSONL│          markcrawl[mcp]      — MCP server for agents
│    + auto-citation       │          markcrawl[langchain] — LangChain tools
└──────────────────────────┘

For internals, see docs/ARCHITECTURE.md.

Extending MarkCrawl

from markcrawl import crawl

result = crawl("https://example.com", out_dir="./output")
print(f"Saved {result.pages_saved} pages")
# Process output in your own pipeline
import json
with open(result.index_file) as f:
    for line in f:
        page = json.loads(line)
        your_db.insert(page)  # Pinecone, Weaviate, Elasticsearch, etc.
# Use individual components
from markcrawl import chunk_text
from markcrawl.extract import LLMClient, extract_fields

See docs/ARCHITECTURE.md for the full module map and extensibility guide.

Cost

The core crawler is free. Two optional features have API costs:

Feature Cost When
Structured extraction ~$0.01-0.03 per page markcrawl-extract
Supabase upload ~$0.0001 per page markcrawl-upload

Setting up API keys

Only needed for extraction and upload. The core crawler requires no keys.

# .env — in your working directory
OPENAI_API_KEY="sk-..."           # extraction (--provider openai) + upload
ANTHROPIC_API_KEY="sk-ant-..."    # extraction (--provider anthropic)
GEMINI_API_KEY="AI..."            # extraction (--provider gemini)
XAI_API_KEY="xai-..."             # extraction (--provider grok)
SUPABASE_URL="https://..."        # upload
SUPABASE_KEY="eyJ..."             # upload (service-role key)
source .env

<details> <summary>Project structure</summary>

.
├── README.md
├── LICENSE
├── PRIVACY.md
├── SECURITY.md
├── CONTRIBUTING.md
├── CODE_OF_CONDUCT.md
├── Dockerfile
├── glama.json
├── pyproject.toml
├── requirements.txt
├── .github/
│   ├── pull_request_template.md
│   └── workflows/
│       ├── ci.yml
│       └── publish.yml
├── docs/
│   ├── ARCHITECTURE.md
│   ├── LLM_PROMPT.md
│   ├── MCP_SUBMISSION.md
│   └── SUPABASE.md
├── tests/
│   ├── test_core.py
│   ├── test_chunker.py
│   ├── test_extract.py
│   └── test_upload.py
└── markcrawl/
    ├── __init__.py
    ├── cli.py
    ├── core.py
    ├── chunker.py
    ├── exceptions.py
    ├── utils.py
    ├── extract.py
    ├── extract_cli.py
    ├── upload.py
    ├── upload_cli.py
    ├── langchain.py
    └── mcp_server.py

</details>

Roadmap

  • [ ] Canonical URL support
  • [ ] Fuzzy duplicate-content detection
  • [ ] PDF support
  • [ ] Authenticated crawling
  • [ ] Multi-provider embeddings

<details> <summary>Shipped features</summary>

  • pip install markcrawl on PyPI
  • 102 automated tests + GitHub Actions CI (Python 3.10-3.13) + ruff linting
  • Markdown and plain text output with auto-citation
  • Sitemap-first crawling with robots.txt compliance
  • Text chunking with configurable overlap
  • Supabase/pgvector upload for RAG
  • JavaScript rendering via Playwright
  • Concurrent fetching and proxy support
  • Resume interrupted crawls
  • LLM extraction (OpenAI, Claude, Gemini, Grok) with auto-field discovery
  • MCP server, LangChain tools, OpenClaw skill </details>

Contributing

See CONTRIBUTING.md. If you used an LLM to generate code, include the prompt in your PR.

Security

See SECURITY.md.

Privacy

MarkCrawl runs locally. No telemetry, no analytics, no data sent anywhere. See PRIVACY.md.

License

MIT. See LICENSE.

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选