Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models
Dec 2, 2025ยท
,,,,,,ยท
1 min read
Chenhang Cui
Gelei Deng
An Zhang
Jingnan Zheng
Yicong Li
Lianli Gao
Tianwei Zhang
Tat-Seng Chua
Abstract
This work shows that individually safe images can be combined with prompts to trigger unsafe behavior in large vision-language models. It introduces Safety Snowball Agent, an agent-based framework that uses model reasoning and tool use to generate or retrieve benign-looking visual context and progressively induce harmful outputs.
Type
Publication
Advances in Neural Information Processing Systems 38 (NeurIPS 2025)
This paper identifies a multimodal safety failure mode where safe visual inputs can snowball into unsafe model behavior when combined with additional safe images and prompts. Safety Snowball Agent operationalizes this observation as a tool-using jailbreak framework for evaluating LVLM guardrails.