Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models

Dec 2, 2025ยท
Chenhang Cui
Gelei Deng
Gelei Deng
,
An Zhang
,
Jingnan Zheng
,
Yicong Li
,
Lianli Gao
,
Tianwei Zhang
,
Tat-Seng Chua
ยท 1 min read
Abstract
This work shows that individually safe images can be combined with prompts to trigger unsafe behavior in large vision-language models. It introduces Safety Snowball Agent, an agent-based framework that uses model reasoning and tool use to generate or retrieve benign-looking visual context and progressively induce harmful outputs.
Type
Publication
Advances in Neural Information Processing Systems 38 (NeurIPS 2025)

This paper identifies a multimodal safety failure mode where safe visual inputs can snowball into unsafe model behavior when combined with additional safe images and prompts. Safety Snowball Agent operationalizes this observation as a tool-using jailbreak framework for evaluating LVLM guardrails.