BerryRAG

BerryRAG

A local vector database RAG system that integrates with Playwright MCP for web scraping, enabling users to build searchable knowledge bases from web content with multiple embedding providers and Claude-optimized context formatting.

Category
访问服务器

README

🍓 BerryRAG: Local Vector Database with Playwright MCP Integration

A complete local RAG (Retrieval-Augmented Generation) system that integrates Playwright MCP web scraping with vector database storage for Claude.

✨ Features

  • Zero-cost self-hosted vector database
  • Playwright MCP integration for automated web scraping
  • Multiple embedding providers (sentence-transformers, OpenAI, fallback)
  • Smart content processing with quality filters
  • Claude-optimized context formatting
  • MCP server for direct Claude integration
  • Command-line tools for manual operation

🚀 Quick Start

1. Installation

git clone https://github.com/berrydev-ai/berry-rag.git
cd berry-rag

# Install dependencies
npm run install-deps

# Setup directories and instructions
npm run setup

2. Configure Claude Desktop

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "playwright": {
      "command": "npx",
      "args": ["@playwright/mcp@latest"]
    },
    "berry-rag": {
      "command": "node",
      "args": ["mcp_servers/vector_db_server.js"],
      "cwd": "/Users/eberry/BerryDev/berry-rag"
    }
  }
}

3. Start Using

# Example workflow:
# 1. Scrape with Playwright MCP through Claude
# 2. Process into vector DB
npm run process-scraped

# 3. Search your knowledge base
npm run search "React hooks"

📁 Project Structure

berry-rag/
├── src/                          # Python source code
│   ├── rag_system.py            # Core vector database system
│   └── playwright_integration.py # Playwright MCP integration
├── mcp_servers/                  # MCP server implementations
│   └── vector_db_server.ts      # TypeScript MCP server
├── storage/                      # Vector database storage
│   ├── documents.db             # SQLite metadata
│   └── vectors/                 # NumPy embedding files
├── scraped_content/             # Playwright saves content here
└── dist/                        # Compiled TypeScript

🔧 Commands

Streamlit Web Interface

Launch the web interface for easy interaction with your RAG system:

# Start the Streamlit web interface
python run_streamlit.py

# Or directly with streamlit
streamlit run streamlit_app.py

The web interface provides:

  • 🔍 Search: Interactive document search with similarity controls
  • 📄 Context: Generate formatted context for AI assistants
  • ➕ Add Document: Upload files or paste content directly
  • 📚 List Documents: Browse your document library
  • 📊 Statistics: System health and performance metrics

NPM Scripts

Command Description
npm run install-deps Install all dependencies
npm run setup Initialize directories and instructions
npm run build Compile TypeScript MCP server
npm run process-scraped Process scraped files into vector DB
npm run search Search the knowledge base
npm run list-docs List all documents

Python CLI

# RAG System Operations
python src/rag_system.py search "query"
python src/rag_system.py context "query"  # Claude-formatted
python src/rag_system.py add <url> <title> <file>
python src/rag_system.py list
python src/rag_system.py stats

# Playwright Integration
python src/playwright_integration.py process
python src/playwright_integration.py setup
python src/playwright_integration.py stats

🤖 Usage with Claude

1. Scraping Documentation

"Use Playwright to scrape the React hooks documentation from https://react.dev/reference/react and save it to the scraped_content directory"

2. Processing into Vector Database

"Process all new scraped files and add them to the BerryRAG vector database"

3. Querying Knowledge Base

"Search the BerryRAG database for information about React useState best practices"

"Get context from the vector database about implementing custom hooks"

🔌 MCP Tools Available to Claude

BerryRAG provides two powerful MCP servers for Claude integration:

Vector DB Server Tools

  • add_document - Add content directly to vector DB
  • search_documents - Search for similar content
  • get_context - Get formatted context for queries
  • list_documents - List all stored documents
  • get_stats - Vector database statistics
  • process_scraped_files - Process Playwright scraped content
  • save_scraped_content - Save content for later processing

BerryExa Server Tools

  • crawl_content - Advanced web content extraction with subpage support
  • extract_links - Extract internal links for subpage discovery
  • get_content_preview - Quick content preview without full processing

📖 For complete MCP setup and usage guide, see BERRY_MCP.md

🧠 Embedding Providers

The system supports multiple embedding providers with automatic fallback:

  1. sentence-transformers (recommended, free, local)
  2. OpenAI embeddings (requires API key, set OPENAI_API_KEY)
  3. Simple hash-based (fallback, not recommended for production)

⚙️ Configuration

Environment Variables

# Optional: for OpenAI embeddings
export OPENAI_API_KEY=your_key_here

Content Quality Filters

The system automatically filters out:

  • Content shorter than 100 characters
  • Navigation-only content
  • Repetitive/duplicate content
  • Files larger than 500KB

Chunking Strategy

  • Default chunk size: 500 characters
  • Overlap: 50 characters
  • Smart boundary detection (sentences, paragraphs)

📊 Monitoring

Check System Status

# Vector database statistics
python src/rag_system.py stats

# Processing status
python src/playwright_integration.py stats

# View recent documents
python src/rag_system.py list

Storage Information

  • Database: storage/documents.db (SQLite metadata)
  • Vectors: storage/vectors/ (NumPy arrays)
  • Scraped Content: scraped_content/ (Markdown files)

🔍 Example Workflows

Academic Research

  1. Scrape research papers with Playwright
  2. Process into vector database
  3. Query for specific concepts across all papers

Documentation Management

  1. Scrape API documentation from multiple sources
  2. Build unified searchable knowledge base
  3. Get contextual answers about implementation details

Content Aggregation

  1. Scrape blog posts and articles
  2. Create topic-based knowledge clusters
  3. Find related content across sources

🛠️ Development

Building the MCP Server

npm run build

Running in Development Mode

npm run dev  # TypeScript watch mode

Testing

# Test RAG system
python src/rag_system.py stats

# Test integration
python src/playwright_integration.py setup

# Test MCP server
node mcp_servers/vector_db_server.js

🚨 Troubleshooting

Common Issues

Python dependencies missing:

pip install -r requirements.txt

TypeScript compilation errors:

npm install
npm run build

Embedding model download slow: The first run downloads sentence-transformers model (~90MB). This is normal.

No results from search:

  • Check if documents were processed: python src/rag_system.py list
  • Verify content quality filters aren't too strict
  • Try broader search terms

Logs and Debugging

  • Python logs: Check console output
  • MCP server logs: Stderr output
  • Processing status: scraped_content/.processed_files.json

📝 License

MIT License - feel free to modify and extend for your needs.

🤝 Contributing

This is a personal project for Eric Berry, but feel free to fork and adapt for your own use cases.


Happy scraping and searching! 🕷️🔍✨

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选