Safety Snowball Agent is an agent-based framework for evaluating how safe visual inputs can combine into unsafe behavior in large vision-language models. The framework accompanies the NeurIPS 2025 paper “Safe + Safe = Unsafe?” and probes a multimodal jailbreak mechanism that differs from traditional adversarial-image attacks.
Dec 2, 2025
NeurIPS 2025 work showing how safe images can combine into multimodal jailbreaks through the Safety Snowball effect.
Dec 2, 2025