Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models

Dec 2, 2025·

Chenhang Cui

Gelei Deng

An Zhang

Jingnan Zheng

Yicong Li

Lianli Gao

Tianwei Zhang

Tat-Seng Chua

· 1 min read

PDF Code Project DOI arXiv Code

Abstract

This work shows that individually safe images can be combined with prompts to trigger unsafe behavior in large vision-language models. It introduces Safety Snowball Agent, an agent-based framework that uses model reasoning and tool use to generate or retrieve benign-looking visual context and progressively induce harmful outputs.

Type

Conference paper

Publication

Advances in Neural Information Processing Systems 38 (NeurIPS 2025)

This paper identifies a multimodal safety failure mode where safe visual inputs can snowball into unsafe model behavior when combined with additional safe images and prompts. Safety Snowball Agent operationalizes this observation as a tool-using jailbreak framework for evaluating LVLM guardrails.

Last updated on Dec 2, 2025