Домой United States USA — IT OpenAI's ChatGPT may face a copyright quagmire after 'memorizing' these books

OpenAI's ChatGPT may face a copyright quagmire after 'memorizing' these books

По

May 2, 2023

120

This top-drawer AI tech has a major science-fiction habit
Boffins at the University of California, Berkeley, have delved into the undisclosed depths of OpenAI’s ChatGPT and the GPT-4 large language model at its heart, and found they’re trained on text from copyrighted books.
Academics Kent Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman describe their work in a paper titled, «Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4.»
«We find that OpenAI models have memorized a wide collection of copyrighted materials, and that the degree of memorization is tied to the frequency with which passages of those books appear on the web,» the researchers explain in their paper.
The team published its code and data on GitHub as well as the list of books identified can be found in this Google Docs file.
GPT-4 was found to have memorized titles such as the Harry Potter children’s books, Orwell’s Nineteen Eighty-Four, The Lord of the Rings trilogy, the Hunger Games books, Hitchhiker’s Guide to the Galaxy, Fahrenheit 451, A Game of Thrones, and Dune, among others.
The authors note that science fiction and fantasy books dominate the list, which they attribute to the popularity of those titles on the web. And they point out that memorizing specific titles has downstream effects. For example, these models make more accurate predictions in answer to prompts such as, «What year was this passage published?» when they’ve memorized the book.
Another consequence of the model’s familiarity with science fiction and fantasy is that ChatGPT exhibits less knowledge of works in other genres. As the paper observes, it knows «little about works of Global Anglophone texts, works in the Black Book Interactive Project and Black Caucus American Library Association award winners.»
Via Twitter, David Bamman, one of the co-authors and an associate professor in the School of Information at UC Berkeley, summarized the paper thus: «Takeaways: open models are good; popular texts are probably not good barometers of model performance; with the bias toward sci-fi/fantasy, we should be thinking about whose narrative experiences are encoded in these models, and how that influences other behaviors.