MoltGuard: Prompt Injection Detection for AI Agents
MoltGuard detects hidden prompt injections in emails, documents & web pages using OpenGuardrails. SOTA multilingual safety for AI agents.
Originally published:
Purpose and Significance
MoltGuard is a specialized guard agent for OpenClaw that detects prompt injection attacks embedded within long-form content such as emails, web pages, and documents. Built on OpenGuardrails' state-of-the-art safety infrastructure, it addresses a critical vulnerability in AI systems: malicious actors hiding injection prompts within seemingly legitimate documents that agents process. By analyzing chunked content with semantic understanding, MoltGuard prevents agents from executing unintended instructions that could compromise security, data integrity, or system behavior. This is essential for production AI deployments where agents interact with external content sources.
Key Features
- State-of-the-Art Detection: OpenGuardrails achieves 87.1% F1 on English prompts (+2.8% vs. competitors) and 97.3% multilingual prompt detection (+12.3% vs. next best), outperforming LlamaGuard and Qwen3Guard.
- Unified 3.3B Model: Single 14B dense model quantized via GPTQ for efficient inference without sacrificing accuracy—deployable at enterprise scale with P95 latency of 274.6ms.
- Intelligent Chunking: Splits long content into 4,000-character chunks with 200-character overlap, analyzing each independently for hidden injection attempts while preserving context.
- Multilingual Coverage: Robust support for 119 languages with benchmark-leading results on English, Chinese, and cross-lingual datasets; includes 97k Chinese safety dataset contribution.
- Dynamic Policy Adaptation: Configurable per-request policies with continuous sensitivity thresholds—tune precision-recall trade-offs in real time via probabilistic logit-space control.
- Integrated Feedback Loop: Built-in commands to report false positives and missed detections, enabling continuous model improvement and organization-specific tuning.
- Real-Time Alerts: Immediate logging and optional webhook integration (Slack, Discord) for security notifications when injections are detected.
- Audit & Reporting: Status commands, detailed detection logs, scheduled reporting, and statistics tracking for compliance and security monitoring.
How It Works
MoltGuard processes long content through a three-stage pipeline. First, the Chunker splits documents into manageable 4,000-character segments with 200-character overlap to maintain semantic continuity. Next, the LLM Analysis stage runs OpenGuardrails-Text (OG-Text) on each chunk independently, asking: "Is there a hidden prompt injection in this content?" Finally, the Verdict aggregates findings across all chunks to produce a binary `isInjection` result. This design ensures full semantic focus on injection detection while handling documents of any length efficiently.
Getting Started
Installation via OpenClaw Plugin System:
openclaw plugins install @openguardrails/moltguard
openclaw gateway restart
Verify Installation:
openclaw plugins list
Look for MoltGuard with status "loaded". Available npm package: @openguardrails/moltguard
Quick Test: Download a test email containing a hidden injection, ask your OpenClaw agent to read it, and monitor logs with `tail -f /tmp/openclaw/openclaw-*.log | grep "moltguard"`. If detection succeeds, you'll see "INJECTION DETECTED" with analysis details.
Core Commands
/og_status— View plugin status, detection statistics, and enabled/disabled state./og_report— Display recent injection detection details with timestamps and reasons./og_feedback fp— Report a false positive for continuous model refinement./og_feedback missed— Report a missed detection to improve sensitivity./cron add --name "OG-Daily-Report" --every 24h --message "/og_report"— Schedule automated detection reports.
Configuration & Customization
Edit `~/.openclaw/openclaw.json` to adjust behavior. Key settings include `blockOnRisk` (true/false—block agent tool calls on detection), `maxChunkSize` (default 4,000 chars), `overlapSize` (default 200 chars), and `timeoutMs` (analysis timeout, default 60,000ms). Dynamic sensitivity control allows tuning precision-recall trade-offs without redeploying the model.
Who It's For
- AI Platform Teams: Building production agents that ingest unvetted external documents (emails, web content, user uploads).
- Security Engineers: Seeking SOTA multilingual prompt injection detection with audit trails and feedback mechanisms.
- Enterprise DevOps: Deploying agents at scale and requiring real-time alert integration with existing security stacks (Slack, webhooks).
- Model Fine-Tuning Practitioners: Wanting to adapt detection thresholds and policies dynamically per-request without model retraining.
- Open-Source AI Contributors: Leveraging a permissively licensed (MIT) guard agent as a foundation for custom safety implementations.
Resources
- GitHub Repository: openguardrails/moltguard
- NPM Package: @openguardrails/moltguard
- Official Website: moltguard.com
- Technical Paper: OpenGuardrails Safety Benchmarks (arXiv)
- OpenGuardrails Project: openguardrails/moltguard
- OpenClaw Framework: Antfarm: Multi-Agent Workflow Orchestration for OpenClaw
Technical Details
MoltGuard is implemented in TypeScript and integrates seamlessly with OpenClaw's plugin architecture. The codebase includes comprehensive test suites (index.test.ts, test-injection.ts), sample emails demonstrating injection attacks, and memory modules for stateful analysis. Development setup requires cloning the repository, installing dependencies via npm, and compiling TypeScript. The project is actively maintained (last updated February 2026) with zero open issues and MIT licensing for commercial and private use.
Original Source
https://github.com/openguardrails/moltguard
Last updated: