Digger: Detecting Copyright Content Mis-usage in Large Language Model Training

Jan 1, 2024ยท
Haodong Li
Gelei Deng
Gelei Deng
,
Yi Liu
,
Kailong Wang
,
Yuekang Li
,
Tianwei Zhang
,
Yang Liu
,
Guowen Xu
,
Guoai Xu
,
Haoyu Wang
ยท 1 min read
Abstract
Large Language Models are trained on massive datasets that may contain copyrighted content without proper authorization. This work presents Digger, a novel approach to detecting whether specific copyrighted content has been used in LLM training, addressing important legal and ethical concerns in AI development.
Type
Publication
arXiv preprint arXiv:2401.00676

Digger introduces methods for detecting whether copyrighted content has been used to train Large Language Models, providing tools for addressing legal and ethical concerns in AI development.