RSafe: Incentivizing Proactive Reasoning to Build Robust and Adaptive LLM Safeguards

Dec 2, 2025·

Jingnan Zheng

Xiangtian Ji

Yijun Lu

Chenhang Cui

Weixiang Zhao

Gelei Deng

Zhenkai Liang

An Zhang

Tat-Seng Chua

· 1 min read

PDF DOI arXiv

Abstract

RSafe is an adaptive reasoning-based safeguard that uses policy-guided safety reasoning and rule-based reinforcement learning to improve guard-model robustness against unseen harmful categories and jailbreak attacks.

Type

Conference paper

Publication

Advances in Neural Information Processing Systems 38 (NeurIPS 2025)

RSafe targets the brittleness of conventional guard models by making safety decisions through explicit reasoning over a user-specified policy, then reinforcing those reasoning traces to improve robustness on adversarial and out-of-distribution safety violations.

Last updated on Dec 2, 2025

Large Language Models AI Safety AI Security Model Alignment

Authors

Gelei Deng

← Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale Jan 15, 2026

Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models Dec 2, 2025 →