MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots
Feb 26, 2024ยท
,,,,,,,,ยท
1 min read

Gelei Deng
Yi Liu
Yuekang Li
Kailong Wang
Ying Zhang
Zefeng Li
Haoyu Wang
Tianwei Zhang
Yang Liu

Abstract
Large Language Models (LLMs) have revolutionized Artificial Intelligence services due to their exceptional proficiency in understanding and generating human-like text. However, LLM chatbots are susceptible to jailbreak attacks, where malicious users manipulate prompts to elicit inappropriate or sensitive responses. This work presents MASTERKEY, a comprehensive framework that offers an in-depth understanding of jailbreak attacks and countermeasures. We introduce an automatic generation method for jailbreak prompts, leveraging a fine-tuned LLM to validate the potential of automated jailbreak generation across various commercial LLM chatbots, achieving a 21.58% success rate compared to 7.33% by existing methods.
Type
Publication
Proceedings 2024 Network and Distributed System Security Symposium (NDSS)
This work presents MASTERKEY, a systematic approach to understanding and exploiting vulnerabilities in Large Language Model chatbots. The framework introduces novel methodologies for automated jailbreak attack generation and provides comprehensive analysis of existing defense mechanisms.
Key Contributions:
- Novel time-based attack strategy inspired by SQL injection techniques
- Automated jailbreak prompt generation achieving 21.58% success rate
- Comprehensive evaluation across mainstream chatbots (ChatGPT, Bard, Bing Chat, Ernie)
- Systematic analysis of defense mechanisms in commercial LLM services
Impact: This research has informed major service providers about critical vulnerabilities and contributed to strengthening LLM security measures across the industry.