MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots

Feb 26, 2024·

Gelei Deng

Yi Liu

Yuekang Li

Kailong Wang

Ying Zhang

Zefeng Li

Haoyu Wang

Tianwei Zhang

Yang Liu

· 1 min read

PDF Cite Code DOI NDSS

MASTERKEY Framework Architecture

Abstract

Large Language Models (LLMs) have revolutionized Artificial Intelligence services due to their exceptional proficiency in understanding and generating human-like text. However, LLM chatbots are susceptible to jailbreak attacks, where malicious users manipulate prompts to elicit inappropriate or sensitive responses. This work presents MASTERKEY, a comprehensive framework that offers an in-depth understanding of jailbreak attacks and countermeasures. We introduce an automatic generation method for jailbreak prompts, leveraging a fine-tuned LLM to validate the potential of automated jailbreak generation across various commercial LLM chatbots, achieving a 21.58% success rate compared to 7.33% by existing methods.

Type

Conference paper

Publication

Proceedings 2024 Network and Distributed System Security Symposium (NDSS)

This work presents MASTERKEY, a systematic approach to understanding and exploiting vulnerabilities in Large Language Model chatbots. The framework introduces novel methodologies for automated jailbreak attack generation and provides comprehensive analysis of existing defense mechanisms.

Key Contributions:

Novel time-based attack strategy inspired by SQL injection techniques
Automated jailbreak prompt generation achieving 21.58% success rate
Comprehensive evaluation across mainstream chatbots (ChatGPT, Bard, Bing Chat, Ernie)
Systematic analysis of defense mechanisms in commercial LLM services

Impact: This research has informed major service providers about critical vulnerabilities and contributed to strengthening LLM security measures across the industry.

Last updated on Feb 26, 2024