AI tools like ChatGPT require copyrighted material, according to OpenAI

AI firms face increasing scrutiny over training data content.

OpenAI has asserted that developing tools like its innovative chatbot ChatGPT would be impractical without access to copyrighted material. This statement comes as artificial intelligence firms face growing pressure regarding the content used to train their products.

Chatbots like ChatGPT and image generators such as Stable Diffusion undergo “training” utilizing an extensive dataset sourced from the internet, much of which is protected by copyright—legal measures against unauthorized use of someone’s work.

Last month, the New York Times filed a lawsuit against OpenAI and Microsoft, a major investor in OpenAI utilizing its tools in products, alleging “unlawful use” of its work in the creation of their products.

In its submission to the House of Lords communications and digital select committee, OpenAI stated that it would be unfeasible to train extensive language models like its GPT-4 model, which powers ChatGPT, without access to copyrighted content.

OpenAI emphasized, “Because copyright today encompasses nearly every form of human expression – encompassing blog posts, photographs, forum contributions, snippets of software code, and government documents – training current top-tier AI models would be unattainable without utilizing copyrighted materials.” This information was first reported by the Telegraph.

OpenAI argued that constraining training materials to out-of-copyright books and drawings would result in insufficient AI systems, stating, “Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.”

In response to the lawsuit from the New York Times, OpenAI posted a blog on its website, asserting, “We support journalism, partner with news organizations, and believe the New York Times lawsuit is without merit.” The company had previously emphasized its respect for “the rights of content creators and owners.” The defense by AI companies for using copyrighted material often hinges on the legal doctrine of “fair use,” which permits certain uses of content without seeking the owner’s permission. OpenAI, in its submission, stated its belief that “legally, copyright law does not forbid training.

The New York Times lawsuit is the latest in a series of legal challenges against OpenAI. In September, 17 authors, including John Grisham, Jodi Picoult, and George RR Martin, filed a lawsuit against OpenAI, alleging “systematic theft on a mass scale.”

Getty Images, the owner of one of the world’s largest photo libraries, is suing Stability AI, the creator of Stable Diffusion, in both the US and England and Wales for alleged copyright infringements. In the US, a group of music publishers, including Universal Music, is suing Anthropic, the Amazon-backed company behind the Claude chatbot, accusing it of misusing “innumerable” copyrighted song lyrics to train its model.

In response to a question about AI safety in its House of Lords submission, OpenAI expressed support for independent analysis of its security measures. The submission indicated backing for “red-teaming” of AI systems, where third-party researchers test the safety of a product by simulating the behavior of rogue actors.

OpenAI is one of the companies that have agreed to collaborate with governments on safety testing their most powerful models before and after deployment, following an agreement reached at a global safety summit in the UK last year.