ReasonForge
Provides a suite of deterministic math tools powered by SymPy to handle algebra, calculus, linear algebra, and statistics via the Model Context Protocol. It enables smaller language models to delegate complex computations to a verified symbolic backend for accurate and reliable results.
README
ReasonForge
Deterministic math tools for small language models.
ReasonForge gives small LLMs (8B–32B) access to a verified SymPy computation backend via tool calling. Instead of relying on the model to compute, all math is delegated to deterministic tools — the model only reasons about what to compute and how to present results.
Architecture
User Question → LLM (Qwen3) → Tool Calls → SymPy Backend → Verified Results → LLM → Final Answer
Multi-Turn Agentic Loop:
- Reason: The model uses
<think>tags to analyze the problem and decide on a strategy. - Execute: The model delegates computation to a deterministic tool (SymPy or Python sandbox).
- Iterate: The model observes the verified tool output and either concludes the answer or calls another tool until solved (up to
MAX_ROUNDS).
Tools
| Tool | Operations | Backend |
|---|---|---|
math_tool |
compute, solve, simplify, factor, expand, gcd, lcm, prime_factors, divisors, mod_inverse, nsolve, crt + SymPy builtins (totient, fibonacci, isprime...) | SymPy |
calculus_tool |
differentiate, integrate, limit, series, summation, partial_fraction, trigsimp, ode_solve, laplace | SymPy |
matrix_tool |
determinant, inverse, eigenvalues, eigenvectors, rank, rref, transpose, multiply, add, trace, nullspace, columnspace, charpoly, norm, adjugate, solve (Ax=b) | SymPy |
statistics_tool |
describe, mean, median, mode, std, variance, correlation, regression, percentile, zscore, skewness, kurtosis, geometric_mean, harmonic_mean | Python stdlib |
code_tool |
run, check, ast_inspect — sandboxed Python code execution, syntax checking, and structure analysis | subprocess |
Project Structure
MCP/
├── core.py # Shared LLM request logic, expert definitions, tool schemas
├── experts/
│ ├── math/
│ │ ├── server.py # MCP server entry point (math tools)
│ │ └── tools/
│ │ ├── preprocess.py # Expression parser (^ → **, implicit multiplication)
│ │ ├── algebra.py # algebra + number theory
│ │ ├── calculus.py # derivatives, integrals, ODEs
│ │ ├── matrix.py # linear algebra
│ │ └── statistics.py # descriptive & inferential stats
│ └── code/
│ ├── server.py # MCP server entry point (code execution)
│ └── tools/
│ └── code.py # Sandboxed Python runner & syntax checker
├── tests/
│ ├── sanity.py # Tool unit tests (16 checks)
│ ├── math_benchmark.py # A/B math benchmark (MATH-500 dataset)
│ ├── code_benchmark.py # A/B code benchmark (HumanEval)
│ └── results/ # Local benchmark outputs
├── ui/
│ ├── app.py # Gradio chat interface with intermediate thinking steps
│ └── style.css # Custom UI styles (dark mode, thinking blocks)
├── ReasonForge_Colab.ipynb # One-click Colab deployment notebook
├── pyproject.toml
├── requirements.txt
├── run_tests.bat # Local tests launcher (Windows)
└── run_ui.bat # Local UI launcher (Windows)
Quick Start (Local)
# Requires: Ollama running with a supported model (qwen3:8b, qwen3:32b, etc.)
uv sync
uv run python -m ui.app
# Open at http://localhost:7861
Colab Deployment (GPU)
Open ReasonForge_Colab.ipynb in Google Colab Pro with an A100 GPU.
It clones this repo, installs Ollama + qwen3:32b, and launches the UI with a public Gradio link.
Benchmarking
# Math benchmark — MATH-500 (requires Ollama running)
uv run python -m tests.math_benchmark --model llama3.2:3b --n 10
uv run python -m tests.math_benchmark --model qwen3:32b --n 50 --think
# Code benchmark — HumanEval (requires Ollama running)
uv run python -m tests.code_benchmark --model qwen3:8b --n 20
uv run python -m tests.code_benchmark --model qwen3:32b --n 164 --think
Running Sanity Tests
uv run python -m tests.sanity
Benchmark Results
MATH-500 (qwen3:8b, 50 problems)
| Metric | Baseline | ReasonForge |
|---|---|---|
| Correct | 43/50 | 45/50 |
| Uniform Accuracy | 86.0% | 90.0% (▲ +4.0%) |
| Weighted Score | 144/176 | 154/176 |
| Weighted Accuracy | 81.8% | 87.5% (▲ +5.7%) |
- Delegation: 40.0% (20/50) of tasks used tools
- Avg Rounds: 1.5
- Avg Time: Baseline 46.3s vs ReasonForge 31.0s (Δ -15.2s)
By Difficulty
Level 1 5/5 100% ████████████████████
Level 2 7/7 100% ████████████████████
Level 3 8/9 89% █████████████████
Level 4 14/15 93% ██████████████████
Level 5 11/14 79% ███████████████ (+14%)
By Category
Algebra 10/12 83% ████████████████
Counting & Probability 4/4 100% ████████████████████
Geometry 4/4 100% ████████████████████
Intermediate Algebra 11/13 85% ████████████████ (+8%)
Number Theory 2/2 100% ████████████████████
Prealgebra 7/7 100% ████████████████████
Precalculus 7/8 88% █████████████████ (+12%)
HumanEval (Code: qwen3:8b, 160 problems)
| Metric | Baseline | ReasonForge |
|---|---|---|
| Pass@1 | 4/160 | 102/160 |
| Accuracy | 2.5% | 63.7% (▲ +61.2%) |
- Delegation: 31.2% (50/160) of tasks used tools
- Avg Rounds: 1.5
- Avg Time: Baseline 23.9s vs ReasonForge 24.8s (Δ +0.9s)
- Wins vs Losses: ReasonForge successfully solved 100 problems that the Baseline failed on, while only losing 2.
Key Takeaways
Testing the 8-billion parameter qwen3 model reveals exactly why deterministic tool-delegation is crucial for smaller models:
- Math (MATH-500): While both models achieved incredibly high baseline accuracy, giving the model access to the SymPy backend massively reduced latency (cutting the average computation time from
46.3sdown to31.0s), all while squeezing out an extra~5%in weighted grading accuracy. - Code (HumanEval): Without sandboxed execution tools, the 8B model almost entirely collapsed on HumanEval, only passing a dismal
4/160(2.5%) of the problems. However, the simple addition of the ReasonForge Python runtime tools allowed the exact same model to safely hypothesize, test, and iteratably structure its code, propelling its accuracy to 102/160 (63.7%)—a gigantic +61.2% improvement with zero fine-tuning required.
Tech Stack
- LLM Backend: Ollama (local) or any OpenAI-compatible API
- Math Engine: SymPy — symbolic computation
- Math Grading: math-verify — deterministic LaTeX parser (Linux/Colab)
- Code Grading: Self-contained HumanEval harness (inspired by openai/human-eval)
- UI: Gradio — chat interface with LaTeX rendering
- Protocol: MCP (Model Context Protocol) compatible
推荐服务器
Baidu Map
百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Playwright MCP Server
一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。
Magic Component Platform (MCP)
一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。
Audiense Insights MCP Server
通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。
VeyraX
一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。
graphlit-mcp-server
模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。
Kagi MCP Server
一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。
e2b-mcp-server
使用 MCP 通过 e2b 运行代码。
Neon MCP Server
用于与 Neon 管理 API 和数据库交互的 MCP 服务器
Exa MCP Server
模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。