RSafe: Incentivizing Proactive Reasoning to Build Robust and Adaptive LLM Safeguards
Dec 2, 2025ยท,,,,
,,,ยท
1 min read
Jingnan Zheng
Xiangtian Ji
Yijun Lu
Chenhang Cui
Weixiang Zhao
Gelei Deng
Zhenkai Liang
An Zhang
Tat-Seng Chua
Abstract
RSafe is an adaptive reasoning-based safeguard that uses policy-guided safety reasoning and rule-based reinforcement learning to improve guard-model robustness against unseen harmful categories and jailbreak attacks.
Type
Publication
Advances in Neural Information Processing Systems 38 (NeurIPS 2025)
RSafe targets the brittleness of conventional guard models by making safety decisions through explicit reasoning over a user-specified policy, then reinforcing those reasoning traces to improve robustness on adversarial and out-of-distribution safety violations.