Efficient Detection of Toxic Prompts in Large Language Models

Oct 27, 2024·

Yi Liu

Junzhe Yu

Huijia Sun

Ling Shi

Gelei Deng

Yuqi Chen

Yang Liu

· 1 min read

PDF DOI arXiv

Abstract

This work introduces ToxicDetector, a lightweight greybox method for detecting toxic prompts in LLMs. It uses toxic concept prompts, embedding-based features, and an MLP classifier to support efficient real-time prompt screening.

Type

Conference paper

Publication

39th IEEE/ACM International Conference on Automated Software Engineering (ASE)

ToxicDetector targets practical, low-latency screening of toxic and jailbreak-style prompts, making prompt safety checks more deployable in interactive LLM applications.

Last updated on Oct 27, 2024

Large Language Models AI Security Jailbreak Attacks Software Engineering

Authors

Gelei Deng

← Source Code Summarization in the Era of Large Language Models Apr 27, 2025

GenderCARE: A Comprehensive Framework for Assessing and Reducing Gender Bias in Large Language Models Oct 14, 2024 →