RSafe: Incentivizing Proactive Reasoning to Build Robust and Adaptive LLM Safeguards

Tue, 02 Dec 2025 00:00:00 +0000

RSafe targets the brittleness of conventional guard models by making safety decisions through explicit reasoning over a user-specified policy, then reinforcing those reasoning traces to improve robustness on adversarial and out-of-distribution safety violations.

Model Alignment | Gelei Deng

RSafe: Incentivizing Proactive Reasoning to Build Robust and Adaptive LLM Safeguards