Efficient Detection of Toxic Prompts in Large Language Models
Oct 27, 2024ยท,,,
,,ยท
1 min read
Yi Liu
Junzhe Yu
Huijia Sun
Ling Shi
Gelei Deng
Yuqi Chen
Yang Liu
Abstract
This work introduces ToxicDetector, a lightweight greybox method for detecting toxic prompts in LLMs. It uses toxic concept prompts, embedding-based features, and an MLP classifier to support efficient real-time prompt screening.
Type
Publication
39th IEEE/ACM International Conference on Automated Software Engineering (ASE)
ToxicDetector targets practical, low-latency screening of toxic and jailbreak-style prompts, making prompt safety checks more deployable in interactive LLM applications.