PANDORA: Jailbreak GPTs by Retrieval Augmented Generation Poisoning

Feb 1, 2024ยท
Gelei Deng
Gelei Deng
,
Yi Liu
,
Kailong Wang
,
Yuekang Li
,
Tianwei Zhang
,
Yang Liu
ยท 1 min read
Abstract
Retrieval Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing Large Language Models with external knowledge. However, the security implications of RAG systems remain underexplored. This work presents PANDORA, a novel attack framework that exploits RAG mechanisms to jailbreak GPT models. We demonstrate that by poisoning the retrieval database, attackers can bypass safety guardrails and elicit harmful responses from LLMs.
Type
Publication
Workshop on Artificial Intelligence System with Confidential Computing (AISCC 2024)

Distinguished Paper Award

PANDORA introduces a novel attack vector against RAG-enhanced LLM systems. By strategically poisoning retrieval databases, attackers can effectively bypass safety mechanisms in models like GPT-4, highlighting critical security considerations for RAG deployments.