Digger: Detecting Copyright Content Mis-usage in Large Language Model Training
Jan 1, 2024ยท
,,,,,,,,ยท
1 min read
Haodong Li
Gelei Deng
Yi Liu
Kailong Wang
Yuekang Li
Tianwei Zhang
Yang Liu
Guowen Xu
Guoai Xu
Haoyu Wang
Abstract
Large Language Models are trained on massive datasets that may contain copyrighted content without proper authorization. This work presents Digger, a novel approach to detecting whether specific copyrighted content has been used in LLM training, addressing important legal and ethical concerns in AI development.
Type
Publication
arXiv preprint arXiv:2401.00676
Digger introduces methods for detecting whether copyrighted content has been used to train Large Language Models, providing tools for addressing legal and ethical concerns in AI development.