RSafe: Incentivizing Proactive Reasoning to Build Robust and Adaptive LLM Safeguards

Dec 2, 2025ยท
Jingnan Zheng
,
Xiangtian Ji
,
Yijun Lu
,
Chenhang Cui
,
Weixiang Zhao
Gelei Deng
Gelei Deng
,
Zhenkai Liang
,
An Zhang
,
Tat-Seng Chua
ยท 1 min read
Abstract
RSafe is an adaptive reasoning-based safeguard that uses policy-guided safety reasoning and rule-based reinforcement learning to improve guard-model robustness against unseen harmful categories and jailbreak attacks.
Type
Publication
Advances in Neural Information Processing Systems 38 (NeurIPS 2025)

RSafe targets the brittleness of conventional guard models by making safety decisions through explicit reasoning over a user-specified policy, then reinforcing those reasoning traces to improve robustness on adversarial and out-of-distribution safety violations.