[edit]
Rethinking Literary Plagiarism in LLMs through the Lens of Copyright Laws
Proceedings of the 16th Asian Conference on Machine Learning, PMLR 260:1000-1015, 2025.
Abstract
The swift advancement of Generative Artificial Intelligence (AI) has outstripped the development of corresponding laws and regulations, highlighting books’ copyright infringement as a significant public concern and sparking numerous legal disputes. Although fair use doctrine exemption for using copyrighted materials in training datasets without the copyright holder’s permission, content generated by such AI systems may still violate copyright laws. Previous research on copyright infringement has primarily focused on character-level analysis, which is narrower in scope compared to the comprehensive requirements of copyright law. To address this challenge, we developed a LLM-based similarity measurement mechanism. We guided the generative AI to produce relevant book content by employing carefully crafted prompts. Subsequently, we created datasets by comparing this generated content with the original texts from famous books. We conducted various experiments, including various similarity detection techniques and plot plagiarism detection. The experimental results show that the AI-generated content (AIGC) is 78.72% similar to the original text, confirming that generative AI has the potential to infringe upon copyrights. Moreover, our study examines copyright infringement issues related to the content generated by generative AI and other domains such as code, images, and licensing. Our research will provide valuable insights for refining laws and regulations about generative AI.