Scientific Paper Harvester MCP Server

Scientific Paper Harvester MCP Server

Provides real-time access to over 200 million scientific papers and full-text extraction from major academic sources including arXiv, OpenAlex, and PubMed Central. It enables users to search, fetch metadata, and analyze citations across multiple research disciplines through a unified Model Context Protocol interface.

Category
访问服务器

README

Scientific Paper Harvester MCP Server

A comprehensive Model Context Protocol (MCP) server that provides LLMs with real-time access to scientific papers from 6 major academic sources: arXiv, OpenAlex, PMC (PubMed Central), Europe PMC, bioRxiv/medRxiv, and CORE.

🚀 Features

Comprehensive Source Coverage

  • arXiv: Computer science, physics, mathematics preprints and papers
  • OpenAlex: Open catalog of scholarly papers with citation data
  • PMC: PubMed Central biomedical and life science literature
  • Europe PMC: European life science literature database
  • bioRxiv/medRxiv: Biology and medical preprint servers
  • CORE: World's largest collection of open access research papers

Advanced Capabilities

  • Paper Fetching: Get latest papers from any source by category/concept
  • Paper Search: Search papers by title, abstract, author, or full-text across 4 major sources
  • Full-Text Extraction: Extract complete text content with intelligent fallback strategies
  • Citation Analysis: Find top cited papers from OpenAlex since a specific date
  • Paper Lookup: Retrieve full metadata for specific papers by ID
  • Category Discovery: Browse available categories from all sources
  • Smart Rate Limiting: Respectful API usage with per-source rate limiting
  • DOI Resolution: Advanced DOI resolver with Unpaywall → Crossref → Semantic Scholar fallback
  • Dual Interface: Both MCP protocol and CLI access
  • TypeScript: Full type safety with ESM modules

📊 Coverage Statistics

  • Total Sources: 6 academic databases
  • Category Coverage: 100+ categories across all disciplines
  • Paper Access: 200M+ papers with intelligent text extraction
  • Text Extraction Success: >90% for supported paper types
  • Response Time: <15 seconds average for paper fetching

🛠 Installation

npm install
npm run build

📋 MCP Client Configuration

To use this server with an MCP client (like Claude Desktop), add the following to your MCP client configuration:

For published package (available on npm):

Option 1: Using npx (recommended for AI tools like Claude)

{
  "mcpServers": {
    "scientific-papers": {
      "command": "npx",
      "args": [
        "-y",
        "@futurelab-studio/latest-science-mcp@latest"
      ]
    }
  }
}

Option 2: Global installation

npm install -g @futurelab-studio/latest-science-mcp

Then configure:

{
  "mcpServers": {
    "scientific-papers": {
      "command": "latest-science-mcp"
    }
  }
}

📖 Usage

CLI Interface

List Categories

# List arXiv categories
node dist/cli.js list-categories --source=arxiv

# List OpenAlex concepts
node dist/cli.js list-categories --source=openalex

# List PMC biomedical categories
node dist/cli.js list-categories --source=pmc

# List Europe PMC life science categories
node dist/cli.js list-categories --source=europepmc

# List bioRxiv/medRxiv categories (includes both servers)
node dist/cli.js list-categories --source=biorxiv

# List CORE academic categories
node dist/cli.js list-categories --source=core

Fetch Latest Papers

# Get latest AI papers from arXiv
node dist/cli.js fetch-latest --source=arxiv --category=cs.AI --count=10

# Get latest biology papers from bioRxiv
node dist/cli.js fetch-latest --source=biorxiv --category="biorxiv:biology" --count=5

# Get latest immunology papers from PMC
node dist/cli.js fetch-latest --source=pmc --category=immunology --count=3

# Get latest papers from CORE by subject
node dist/cli.js fetch-latest --source=core --category=computer_science --count=5

# Search by concept name (OpenAlex)
node dist/cli.js fetch-latest --source=openalex --category="machine learning" --count=3

Fetch Top Cited Papers

# Get top 20 cited papers in machine learning since 2024
node dist/cli.js fetch-top-cited --concept="machine learning" --since=2024-01-01 --count=20

# Get top cited papers by concept ID
node dist/cli.js fetch-top-cited --concept=C41008148 --since=2023-06-01 --count=10

Search Papers

# Search by keywords across all fields
node dist/cli.js search-papers --source=arxiv --query="machine learning" --count=10

# Search by paper title
node dist/cli.js search-papers --source=openalex --query="neural networks" --field=title --count=5

# Search by author name
node dist/cli.js search-papers --source=europepmc --query="John Smith" --field=author --count=10

# Search full-text content sorted by citations
node dist/cli.js search-papers --source=core --query="climate change" --field=fulltext --sortBy=citations --count=20

Fetch Specific Paper Content

# Get arXiv paper by ID
node dist/cli.js fetch-content --source=arxiv --id=2401.12345

# Get bioRxiv paper by DOI
node dist/cli.js fetch-content --source=biorxiv --id="10.1101/2021.01.01.425001"

# Get PMC paper by ID
node dist/cli.js fetch-content --source=pmc --id=PMC8245678

# Get CORE paper by ID
node dist/cli.js fetch-content --source=core --id=12345678

# Show text content with preview
node dist/cli.js fetch-content --source=arxiv --id=2401.12345 --show-text --text-preview=500

🔧 Available Tools

list_categories

Lists available categories/concepts from any data source.

Parameters:

  • source: "arxiv" | "openalex" | "pmc" | "europepmc" | "biorxiv" | "core"

Returns:

  • Array of category objects with id, name, and optional description

Examples:

{
  "name": "list_categories",
  "arguments": {
    "source": "biorxiv"
  }
}

fetch_latest

Fetches the latest papers from any source for a given category with metadata only (no text extraction).

Parameters:

  • source: "arxiv" | "openalex" | "pmc" | "europepmc" | "biorxiv" | "core"
  • category: Category ID or concept name (varies by source)
  • count: Number of papers to fetch (default: 50, max: 200)

Category Examples by Source:

  • arXiv: "cs.AI", "physics.gen-ph", "math.CO"
  • OpenAlex: "artificial intelligence", "machine learning", "C41008148"
  • PMC: "immunology", "genetics", "neuroscience"
  • Europe PMC: "biology", "medicine", "cancer"
  • bioRxiv/medRxiv: "biorxiv:neuroscience", "medrxiv:psychiatry"
  • CORE: "computer_science", "mathematics", "physics"

Returns:

  • Array of paper objects with metadata (id, title, authors, date, pdf_url)
  • Text field: Empty string (text: "") - use fetch_content for full text

fetch_top_cited

Fetches the top cited papers from OpenAlex for a given concept since a specific date.

Parameters:

  • concept: Concept name or OpenAlex concept ID
  • since: Start date in YYYY-MM-DD format
  • count: Number of papers to fetch (default: 50, max: 200)

search_papers

Searches for papers across multiple academic sources with field-specific search and sorting options.

Parameters:

  • source: "arxiv" | "openalex" | "europepmc" | "core"
  • query: Search query string (max 1500 characters)
  • field: "all" | "title" | "abstract" | "author" | "fulltext" (default: "all")
  • count: Number of results to return (default: 50, max: 200)
  • sortBy: "relevance" | "date" | "citations" (default: "relevance")

Search Capabilities by Source:

  • arXiv: Title, abstract, author, and general search with Boolean operators
  • OpenAlex: Advanced search with relevance scoring and citation sorting
  • Europe PMC: Biomedical literature with MeSH terms and full-text search
  • CORE: Global academic papers with advanced query language

Example Queries:

  • Keywords: "machine learning", "climate change"
  • Phrases: "artificial intelligence" (use quotes for exact phrases)
  • Boolean: "deep learning AND neural networks" (arXiv supports this)
  • Authors: "John Smith", "Smith J"

Returns:

  • Array of paper objects with metadata (id, title, authors, date, pdf_url)
  • Text field: Empty string (text: "") - use fetch_content for full text

fetch_content

Fetches full metadata and text content for a specific paper by ID with complete text extraction.

Parameters:

  • source: Any of the 6 supported sources
  • id: Paper ID (format varies by source)

ID Formats by Source:

  • arXiv: "2401.12345", "cs/0601001", "1234.5678v2"
  • OpenAlex: "W2741809807" or numeric 2741809807
  • PMC: "PMC8245678" or "12345678"
  • Europe PMC: "PMC8245678", "12345678", or DOI
  • bioRxiv/medRxiv: "10.1101/2021.01.01.425001" or "2021.01.01.425001"
  • CORE: Numeric ID like "12345678"

📄 Paper Metadata Format

All tools return paper objects with the following structure:

{
  id: string;                    // Paper ID
  title: string;                 // Paper title
  authors: string[];             // List of author names
  date: string;                  // Publication date (ISO format)
  pdf_url?: string;              // PDF URL (if available)
  text: string;                  // Extracted full text content
  textTruncated?: boolean;       // Warning: text was truncated due to size limits
  textExtractionFailed?: boolean; // Warning: text extraction failed
}

🧠 Advanced Text Extraction

Multi-Source Strategy

Each source has specialized text extraction approaches:

  • arXiv: HTML from arxiv.org/html with ar5iv.labs.arxiv.org fallback
  • OpenAlex: HTML sources with DOI resolver fallback chain
  • PMC: E-utilities API with XML/HTML extraction
  • Europe PMC: REST API with multiple URL strategies
  • bioRxiv/medRxiv: Direct HTML extraction with abstract fallback
  • CORE: PDF/HTML with source URL fallback

DOI Resolution Chain

Advanced DOI resolver with multiple fallback strategies:

  1. Unpaywall → Free full-text sources
  2. Crossref → Publisher metadata and links
  3. Semantic Scholar Academic Graph → Alternative access

Performance & Reliability

  • Text Extraction Success: >90% for HTML-available papers
  • Graceful Degradation: Always returns metadata even if text extraction fails
  • Size Management: 6MB text limit with intelligent truncation
  • Caching: 24-hour LRU cache for DOI resolution

🔄 Rate Limiting

Respectful API usage with per-source rate limiting:

  • arXiv: 5 requests per minute
  • OpenAlex: 10 requests per minute
  • PMC: 3 requests per second
  • Europe PMC: 10 requests per minute
  • bioRxiv/medRxiv: 5 requests per minute
  • CORE: 10 requests per minute (public), higher with API key

CORE API Configuration

For enhanced CORE access, set environment variable:

export CORE_API_KEY="your-api-key"

🧪 Testing

Run Test Suite

# Run all tests
npm test

# Run integration tests
npm run test -- tests/integration

# Run end-to-end workflow tests
npm run test -- tests/e2e

# Run performance benchmarks
npm run test -- tests/integration/performance.test.ts

Test Coverage

  • Integration Tests: All 6 sources tested end-to-end
  • Performance Tests: Response time and throughput benchmarks
  • Workflow Tests: Real research scenarios across multiple sources
  • Unit Tests: Core components and edge cases

🏗 Architecture

Modular Driver System

  • Clean separation between sources
  • Consistent interface across all drivers
  • Specialized text extraction per source

Advanced Features

  • DOI Resolution: Multi-provider fallback chain
  • Rate Limiting: Token bucket algorithm per source
  • Text Processing: HTML cleaning and normalization
  • Error Handling: Structured responses with actionable suggestions
  • Caching: Intelligent caching for DOI resolution

Technology Stack

  • TypeScript + ESM: Modern JavaScript with full type safety
  • Modular Design: Clean separation of concerns
  • Graceful Degradation: Always functional even with partial failures
  • Response Size Management: Automatic truncation and warnings

📊 Source Comparison

Source Papers Disciplines Full-Text Citation Data Preprints Search
arXiv 2.3M+ STEM HTML ✓ Limited ✓✓✓
OpenAlex 200M+ All Variable ✓✓✓ ✓✓✓
PMC 7M+ Biomedical XML/HTML ✓ Limited Limited
Europe PMC 40M+ Life Sciences HTML ✓ Limited ✓✓✓
bioRxiv/medRxiv 500K+ Bio/Medical HTML ✓ Limited ✓✓✓ Limited
CORE 200M+ All PDF/HTML ✓ Limited ✓✓✓

🔧 Development

Build

npm run build

Test Individual Sources

# Test specific sources
node dist/cli.js list-categories --source=arxiv
node dist/cli.js fetch-latest --source=biorxiv --category="biorxiv:biology" --count=3
node dist/cli.js fetch-content --source=core --id=12345678

# Test search functionality
node dist/cli.js search-papers --source=arxiv --query="artificial intelligence" --count=5
node dist/cli.js search-papers --source=openalex --query="quantum computing" --field=title --count=3

Performance Testing

# Run performance benchmarks
npm run test -- tests/integration/performance.test.ts

# Test memory usage
npm run test -- --reporter=verbose

🚨 Error Handling

Comprehensive error handling for all sources:

  • Invalid paper IDs with format suggestions
  • Rate limiting with retry-after information
  • API timeouts and server errors
  • Missing authentication (CORE API key)
  • Network connectivity issues
  • Text extraction failures with fallback strategies

🔍 Troubleshooting

Common Issues

  • Rate limiting: Automatic retry with exponential backoff
  • Missing papers: Try alternative sources for the same content
  • Text extraction failures: Fallback to abstract or metadata
  • CORE API limits: Set CORE_API_KEY environment variable

Performance Optimization

  • Use appropriate count parameters (smaller for faster responses)
  • Cache results when possible
  • Use fetch_latest for discovery, fetch_content for detailed reading

📝 License

MIT


Ready to explore the world's scientific knowledge? Start with any of the 6 sources and discover papers across all academic disciplines! 🔬📚

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选