ZeroFuse

ZeroFuse

An MCP server that enables agents to automate refusal direction removal from open-weight LLMs via Optuna-driven search, producing standard Hugging Face models with no inference overhead.

Category
访问服务器

README

<div align="center">

⚡ ZeroFuse

Point it at a model. It conducts trials, identifies the refusal architecture, and abliterates it.

Automated, capability-preserving refusal removal for open-weight transformer LLMs — a direct weight edit that produces a standard Hugging Face model with zero inference-time overhead.

Created by osmAPI.com

Created by osmAPI.com License: MIT Python 3.11+ Built with PyTorch 🤗 Transformers Optuna Agent-native: MCP Status: v0.1.0

</div>


🇮🇳 osmAPI.com is the only provider in India offering abliterated models via API. ZeroFuse is the engine that powers them.

ZeroFuse turns guardrail removal into a one-command, fully-automated optimization problem — no hand-picking layers, no guessing strengths, no retraining. It estimates the model's refusal direction, orthogonalizes it out of the residual-writing weights, and uses a two-objective search to preserve capability. The output is a standard Hugging Face checkpoint you can load, quantize, or serve like any other.

Table of Contents

Why ZeroFuse

Most abliteration workflows are manual: you pick a layer, eyeball a strength coefficient, run the model, check whether it still refuses, and repeat — often degrading the model's general capabilities along the way. ZeroFuse replaces that loop with a principled, automated search.

  • Fully automatic — no hand-picking layers, directions, or strengths. You point it at a model and it conducts the trials.
  • Capability-preserving by design — KL divergence from the original model is an explicit optimization objective, co-minimized alongside refusals, not an afterthought.
  • A real weight edit, not a runtime adapter — orthogonalizes the refusal direction directly out of attention o_proj and MLP down_proj (W' = W − strength · r(rᵀW)). The saved model has zero inference-time overhead: no LoRA to load, no runtime hooks, no wrapper.
  • Pareto-front control — a two-objective Optuna TPE search hands you the full trade-off curve. Pick the point you want: fewest refusals, lowest KL, or the knee.
  • Grounded in published research — difference-of-means refusal direction (Arditi et al. 2024) with optional projected refinement (grimjim 2025) to reduce collateral damage.
  • Broad model support — dense models, MoE (including per-expert down_proj), and many multimodal nestings.
  • Resumable — Optuna studies are journaled to disk; re-run the same command to continue where you left off.
  • Agent-native — ships an MCP server for Claude Desktop, OpenAI Codex, Google AntiGravity, and any MCP client. Quiet by default, verbose on request.
  • One command, or one importzerofuse --model <hf-id-or-path>, or from zerofuse import abliterate.

🚀 One Command

# Clone and install (editable)
git clone https://github.com/junainfinity/ZeroFuse.git
cd ZeroFuse
pip install -e .                # core
pip install -e ".[mcp]"         # + the agent/MCP server (optional)

# Point it at any Hugging Face model id or local path
zerofuse --model meta-llama/Llama-3.1-8B-Instruct

That's the whole loop. ZeroFuse:

  1. Loads the target model and captures residual-stream activations.
  2. Identifies the refusal direction via difference-of-means on harmful vs. harmless prompts.
  3. Conducts trials — a two-objective Optuna search over layers and strengths, co-minimizing refusals and KL divergence.
  4. Abliterates by orthogonalizing the chosen direction out of the weights.
  5. Writes a standard Hugging Face model directory you can load with from_pretrained — no special runtime required.
# Resume an interrupted run — same command, picks up the journaled study
zerofuse --model meta-llama/Llama-3.1-8B-Instruct

# Quiet (only high-level phases) — or fully verbose
zerofuse --model meta-llama/Llama-3.1-8B-Instruct --quiet
zerofuse --model meta-llama/Llama-3.1-8B-Instruct --verbose

[!NOTE] ZeroFuse needs enough memory to load and run forward passes on the target model. Plan capacity for the model you point it at.

🔬 How It Works

ZeroFuse implements the published "refusal direction" line of research as a clean-room MIT build, wrapped in an automated optimizer.

1. Estimate the refusal direction

It captures residual-stream activations on a set of harmful and harmless prompts and takes the difference of means. The unit refusal direction is:

$$ r ;=; \frac{\mu_{\text{harmful}} - \mu_{\text{harmless}}}{\lVert \mu_{\text{harmful}} - \mu_{\text{harmless}} \rVert} $$

where $\mu_{\text{harmful}}$ and $\mu_{\text{harmless}}$ are the mean residual-stream activations over harmful and harmless prompts respectively (Arditi et al., 2024). An optional projected refinement step (grimjim, 2025) sharpens the estimate to reduce collateral damage.

2. Orthogonalize it out of the weights

Rather than subtract the direction at runtime, ZeroFuse edits the weights that write into the residual stream so they can no longer contribute along $r$:

$$ W' ;=; W ;-; \text{strength} \cdot r,(r^{\top} W) $$

This is applied to the attention output projection (o_proj) and the MLP down-projection (down_proj), including MoE experts. The scalar strength controls how much of the $r$-component is removed: at strength = 1 this is a full orthogonal projection that removes the component entirely; smaller values remove it partially. Because the edit lives in the weights, the resulting model is indistinguishable in shape and speed from the original.

3. Search the Pareto front

Choosing layers and strengths by hand is the hard part — so ZeroFuse doesn't. It runs an Optuna TPE multi-objective search that co-minimizes two objectives:

$$ \min ;\big(; N_{\text{refusals}}, ;; D_{\mathrm{KL}}(P_{\text{orig}} ,\Vert, P_{\text{edited}}) ;\big) $$

  • $N_{\text{refusals}}$ — how often the edited model still refuses, scored by the evaluator.
  • $D_{\mathrm{KL}}$ — how far the edited model's output distribution has drifted from the original, as a proxy for lost capability.

The result is a Pareto front of non-dominated configurations. You choose the operating point that fits your goal — fewest refusals, lowest KL, or the knee of the curve — and ZeroFuse materializes that exact weight edit.

refusals
  ^
  |  x
  |   x
  |     x  <- knee
  |        x x
  |            x x x
  +-------------------> KL divergence
   (each x = a non-dominated trial on the Pareto front)

⚖️ ZeroFuse vs. the Alternatives

Capability ZeroFuse Manual abliteration Fine-tuning
Setup effort One command: point it at an HF id or path; layers and strengths are picked automatically Hand-select target layers, directions, and strengths through trial and error Assemble a dataset, configure a training run, and manage compute
Weights vs. runtime Direct weight edit — orthogonalizes the refusal direction out of o_proj and down_proj Also a weight edit, but applied manually with chosen parameters Updates weights via gradient descent over a training corpus
Capability preservation KL divergence from the original model is an explicit optimization objective Depends on the operator's manual tuning; no built-in capability objective Risk of catastrophic forgetting; mitigation depends on data and hyperparameters
Tuning the trade-off Two-objective Optuna TPE search yields a Pareto front; pick fewest refusals, lowest KL, or the knee Re-run by hand and eyeball results; no systematic Pareto search Adjust data mix and hyperparameters and retrain to shift the trade-off
Inference-time overhead None — output is a standard Hugging Face model None if done as a weight edit; runtime adapters add overhead None for a full fine-tune; LoRA adapters add overhead unless merged
Compute cost Runs trials and a KL/refusal search; no gradient-based retraining Low compute, but high human time per iteration Highest — training compute proportional to model and dataset size
Resumability Optuna studies journaled to disk; re-run the same command to continue Manual — depends on your own bookkeeping Checkpoint-based resume, depending on the training framework
Agent / automation Ships an MCP server for Claude Desktop, OpenAI Codex, Google AntiGravity, and any MCP client None built in None built in
Output format Standard Hugging Face model — load, quantize, or serve like any other Modified model; format depends on the tooling used Standard weights or a LoRA adapter, depending on method
Model support Dense, MoE (per-expert down_proj), and many multimodal nestings; pure state-space out of scope Whatever the operator manually implements support for Broad, subject to framework support for the architecture

🤖 Agent-native / MCP

ZeroFuse ships a built-in Model Context Protocol server, so an agent can drive the whole pipeline as a tool. It works in Claude Desktop, OpenAI Codex, Google AntiGravity, and any MCP-compatible client.

Install the optional dependency and add it to your MCP client config:

pip install -e ".[mcp]"
{
  "mcpServers": {
    "zerofuse": {
      "command": "zerofuse-mcp"
      // installed alongside the CLI by `pip install -e ".[mcp]"`
    }
  }
}

It exposes a single abliterate tool and is designed to be a well-behaved citizen of an agent's context window:

  • Quiet by default. The harness sees only high-level phases — identifying refusal architecture, conducting trials, abliterating — not a firehose of internals.
  • Opt-in detail. Per-trial metrics, layer choices, and KL traces are emitted at MCP debug log level and surface only if the harness opts in to debug logs.
  • Override when you want it. A verbose argument forces full detail regardless of log level.

This keeps long-running optimization runs legible to an agent instead of flooding it with token-heavy progress chatter. See docs/agents.html for per-harness setup.

🐍 Python API

Everything the CLI does is available as a library:

from zerofuse import abliterate

# One call: returns the saved HF model dir + the Pareto front to pick from.
result = abliterate("meta-llama/Llama-3.1-8B-Instruct", n_trials=100)
print(result.selected.refusals, result.selected.kl, result.output_dir)

Or build a full configuration explicitly:

from zerofuse import ZeroFuseConfig, run

config = ZeroFuseConfig.from_dict({
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "optimization": {"n_trials": 100},
})
result = run(config, selection="knee")

🧩 Supported Models

ZeroFuse is built to work on most open-weight transformer models you point it at:

Architecture Support
Dense transformer LLMs ✅ Supported
Mixture-of-Experts (per-expert down_proj) ✅ Supported
Multimodal nestings with a transformer LLM backbone ✅ Many supported
Pure state-space models ❌ Out of scope

Because architectures vary, ZeroFuse is designed to generalize across these families rather than guaranteed to abliterate every model — it adapts to the residual-writing weights it finds.

📁 Project Structure

ZeroFuse/
├── src/zerofuse/
│   ├── config.py        # Run configuration & defaults (TOML + CLI)
│   ├── prompts.py       # Harmful / harmless prompt loading + batching
│   ├── directions.py    # Pure math: difference-of-means, projected refinement
│   ├── model.py         # Loading, activation capture, weight orthogonalization
│   ├── evaluator.py     # Scoring: refusal detection + KL divergence
│   ├── optimizer.py     # Optuna TPE search + Pareto-front selection
│   ├── pipeline.py      # End-to-end orchestration
│   ├── reporting.py     # Quiet-by-default progress (phases vs. details)
│   ├── cli.py           # `zerofuse` command-line entrypoint
│   └── mcp_server.py    # Model Context Protocol server (agent-native)
├── docs/                # Self-contained HTML documentation site
├── config/default.toml  # Fully-commented configuration template
└── tests/               # Unit tests for the pure-logic parts

Each module has a single responsibility. directions.py is pure math — no model objects, easy to test and audit. model.py is the only place weights are touched.

❓ FAQ

<details> <summary><strong>How does ZeroFuse remove refusals without retraining?</strong></summary>

It estimates the model's "refusal direction" via difference-of-means of residual-stream activations on harmful vs. harmless prompts (Arditi et al. 2024, arXiv:2406.11717), then orthogonalizes that direction out of the residual-writing weights — attention o_proj and MLP down_proj, including MoE experts — using W' = W − strength · r(rᵀW). No gradient-based training is involved; it's a direct edit to the existing weights. </details>

<details> <summary><strong>Will abliteration degrade the model's capabilities?</strong></summary>

ZeroFuse is built to minimize that. KL divergence from the original model is an explicit optimization objective alongside the number of refusals, and a two-objective Optuna TPE search produces a Pareto front so you can choose how to balance fewest refusals against lowest KL. There's also an optional projected refinement step (grimjim 2025) designed to reduce collateral damage. As a v0.1.0 project, these are design goals rather than independently benchmarked guarantees. </details>

<details> <summary><strong>What kinds of models does it work on?</strong></summary>

It's designed to work on most open-weight transformer models you point it at — dense models, MoE models (including per-expert down_proj), and many multimodal nestings. Pure state-space models are out of scope. </details>

<details> <summary><strong>What do I actually get as output, and is there any runtime cost?</strong></summary>

You get a standard Hugging Face model — the refusal behavior is edited into the weights themselves, not delivered as a runtime LoRA adapter. That means zero inference-time overhead: you can load, quantize, and serve it exactly like any other Hugging Face model. </details>

<details> <summary><strong>How do I run it, and how does the MCP / agent integration work?</strong></summary>

Install with pip install -e . and run zerofuse --model <hf-id-or-path>, or use the Python API with from zerofuse import abliterate. ZeroFuse also ships an MCP server (pip install -e ".[mcp]") that works in Claude Desktop, OpenAI Codex, Google AntiGravity, and any MCP client. It's quiet by default — the harness sees only high-level phases like "identifying refusal architecture," "conducting trials," and "abliterating," with internal details emitted at MCP debug log level and shown only if the harness opts in (a verbose flag overrides). Runs are resumable: Optuna studies are journaled to disk, so re-running the same command continues where you left off. </details>

<details> <summary><strong>What about licensing and responsible use?</strong></summary>

ZeroFuse is MIT-licensed — an independent clean-room build from published papers and an Apache-2.0 reference implementation that copies no copyleft tool, with citations documented in NOTICE.md. The MIT license covers only the tool, not the models you produce. Because ZeroFuse reduces a model's guardrails, you are responsible for complying with the base model's license and acceptable-use policy, applicable law, and any platform terms that apply to the models you create and deploy. </details>

📌 Status

v0.1.0 — new project. ZeroFuse is early. The method is grounded in published research and the implementation is built to preserve capability, but it has not yet been independently benchmarked at scale. Where this README says designed to or built to, that is a deliberate statement that the claim is true by construction, not yet third-party-verified. No benchmark numbers, star counts, or testimonials are presented here because there aren't any to honestly report yet. Issues and reproductions welcome.

🛡️ Responsible Use

ZeroFuse reduces or removes safety guardrails from model weights. That capability carries real responsibility.

  • You are responsible for compliance with the base model's license and acceptable-use policy, all applicable law, and the terms of any platform you deploy on.
  • The MIT license covers this tool only — it does not grant you any rights over, or responsibility for, the models you produce or process. Those are governed by the original model's license.
  • Use it on models you are permitted to modify, for purposes you are permitted to pursue.

Removing guardrails does not remove accountability. Think before you point it at something.

📜 Provenance & License

ZeroFuse is created and maintained by osmAPI.com — the only provider in India offering abliterated models via API.

It is an independent, clean-room implementation built from published papers and an Apache-2.0 reference implementation. It does not copy, vendor, or derive from any copyleft tool. Citations and attributions are documented in NOTICE.md.

The tool is released under the MIT License. The MIT license covers the tool only — not the models you produce with it.

📚 Citations

  • Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., & Nanda, N. (2024). Refusal in Language Models Is Mediated by a Single Direction. NeurIPS 2024. arXiv:2406.11717
  • Jim Lai (grimjim) (2025). Projected & norm-preserving refinements of the refusal direction for reduced collateral damage. Hugging Face blog.

See NOTICE.md for the full reference list and attributions.

🤝 Contributing

PRs are welcome. Good first contributions: new model-family adapters, additional refusal evaluators, prompt-set improvements, and docs. Please keep directions.py pure and confine weight mutation to model.py.


<div align="center">

ZeroFuse · created by osmAPI.com · MIT · built with PyTorch · Optuna · 🤗 Transformers

<sub>Point it at a model. It does the rest.</sub>

</div>

推荐服务器

Baidu Map

Baidu Map

百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。

官方
精选
JavaScript
Playwright MCP Server

Playwright MCP Server

一个模型上下文协议服务器,它使大型语言模型能够通过结构化的可访问性快照与网页进行交互,而无需视觉模型或屏幕截图。

官方
精选
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

一个由人工智能驱动的工具,可以从自然语言描述生成现代化的用户界面组件,并与流行的集成开发环境(IDE)集成,从而简化用户界面开发流程。

官方
精选
本地
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

通过模型上下文协议启用与 Audiense Insights 账户的交互,从而促进营销洞察和受众数据的提取和分析,包括人口统计信息、行为和影响者互动。

官方
精选
本地
TypeScript
VeyraX

VeyraX

一个单一的 MCP 工具,连接你所有喜爱的工具:Gmail、日历以及其他 40 多个工具。

官方
精选
本地
graphlit-mcp-server

graphlit-mcp-server

模型上下文协议 (MCP) 服务器实现了 MCP 客户端与 Graphlit 服务之间的集成。 除了网络爬取之外,还可以将任何内容(从 Slack 到 Gmail 再到播客订阅源)导入到 Graphlit 项目中,然后从 MCP 客户端检索相关内容。

官方
精选
TypeScript
Kagi MCP Server

Kagi MCP Server

一个 MCP 服务器,集成了 Kagi 搜索功能和 Claude AI,使 Claude 能够在回答需要最新信息的问题时执行实时网络搜索。

官方
精选
Python
e2b-mcp-server

e2b-mcp-server

使用 MCP 通过 e2b 运行代码。

官方
精选
Neon MCP Server

Neon MCP Server

用于与 Neon 管理 API 和数据库交互的 MCP 服务器

官方
精选
Exa MCP Server

Exa MCP Server

模型上下文协议(MCP)服务器允许像 Claude 这样的 AI 助手使用 Exa AI 搜索 API 进行网络搜索。这种设置允许 AI 模型以安全和受控的方式获取实时的网络信息。

官方
精选