Efficient Detection of Toxic Prompts in Large Language Models

Oct 27, 2024ยท
Yi Liu
,
Junzhe Yu
,
Huijia Sun
,
Ling Shi
Gelei Deng
Gelei Deng
,
Yuqi Chen
,
Yang Liu
ยท 1 min read
Abstract
This work introduces ToxicDetector, a lightweight greybox method for detecting toxic prompts in LLMs. It uses toxic concept prompts, embedding-based features, and an MLP classifier to support efficient real-time prompt screening.
Type
Publication
39th IEEE/ACM International Conference on Automated Software Engineering (ASE)

ToxicDetector targets practical, low-latency screening of toxic and jailbreak-style prompts, making prompt safety checks more deployable in interactive LLM applications.