Journal of Intellectual Property Law & Practice, 2024, Vol. 00, No.
00
DOI: https://doi.org/10.1093/jiplp/jpae102
Article
Copyright and AI training data—transparency
to the rescue?
Downloaded from https://academic.oup.com/jiplp/advance-article/doi/10.1093/jiplp/jpae102/7922541 by guest on 27 February 2025
Adam Buick *
Lecturer in Law, School of Law, Ulster University, Belfast, UK.
*
Email: a.buick@ulster.ac.uk.
Abstract
• Generative Artificial Intelligence (AI) models must be trained on vast quantities of data, much of which is composed of copy-
righted material. However, AI developers frequently use such content without seeking permission from rightsholders, leading
to calls for requirements to disclose information on the contents of AI training data. These demands have won an early success
through the inclusion of such requirements in the EU’s AI Act.
• This article argues that such transparency requirements alone cannot rescue us from the difficult question of how best to
respond to the fundamental challenges generative AI poses to copyright law. This is because the impact of transparency require-
ments is contingent on existing copyright laws; if these do not adequately address the challenges presented by generative AI,
transparency will not provide a solution. This is exemplified by the transparency requirements of the AI Act, which are explicitly
designed to facilitate the enforcement of the right to opt-out of text and data mining under the Copyright in the Digital Single
Market Directive. Because the transparency requirements do not sufficiently address the underlying flaws of this opt-out, they
are unlikely to provide any meaningful improvement to the position of individual rightsholders.
• Transparency requirements are thus a necessary but not sufficient measure to achieve a fair and equitable balance between
innovation and protection for rightsholders. Policymakers must therefore look beyond such requirements and consider further
action to address the complex challenge presented to copyright law by generative AI.
1. Introduction opportunities, however, generative AI also presents policymak-
ers with significant challenges, such as how to respond to the
Since the debut of ChatGPT in late 2022, the attention of pol-
technology’s potential to replicate or amplify existing biases and
icy makers around the world has been captured by generative
prejudices, spread misinformation and threaten the livelihoods of
Artificial Intelligence (AI)—that is, AI models that are capable
human workers (especially in the creative industries).4 The law is
of creating data such as text, images, audio or video content.1
therefore faced with the difficult question of how the harms of
Widely viewed as a technology with transformative potential, gen-
generative AI can be mitigated without stifling innovation.
erative AI is expected by many to add trillions of dollars to the
Given its role in governing the ownership and use of cre-
global economy over the next decade,2 prompting numerous gov-
ative works, it is inevitable that copyright is one of the areas of
ernments to declare increased innovation and investment in the
law at the forefront of this regulatory challenge. Generative AI
technology as a key policy goal.3 In addition to the promised
raises profound questions regarding the foundational assump-
tions that underpin copyright law,5 one of the most pressing of
1
Adam Zewe, ‘Explained: Generative AI’ (MIT Schwarzman College of Computing, 9
November 2023). Available at https://computing.mit.edu/news/explained-generative-ai/
(accessed 14 October 2024). (accessed 1 November 2024); HM Government (UK), National AI Strategy, available at
2
Bloomberg Intelligence, Generative AI 2024 Report (Bloomberg 2024). https://www.gov.uk/government/publications/national-ai-strategy, 18 December 2022,
3
See eg The White House (USA) Executive Order on the Safe, Secure, and Trustwor- (accessed 1 November 2024).
thy Development and Use of Artificial Intelligence, 30 October 2023, available at https:// 4
There is already evidence that generative AI is behind a recent drop in demand for
www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order- the services of some freelance workers—see further Ozge Demirci, Jonas Hannane and
on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/ Xinrong Zhu, ‘Who Is AI Replacing? The Impact of Generative AI on Online Freelancing
accessed 1 November 2024; Ministère de l’Économie, des Finances et de la Platforms’ (2024) CESifo Working Paper 11 276.
5
Relance (France), Stratégie Nationale pour l’Intelligence Artificielle, 22 May 2024, avail- See further, Mark Lemley, ‘How Generative AI turns Copyright Upside Down’ (2024)
able at https://www.economie.gouv.fr/strategie-nationale-intelligence-artificielle 25 Columbia Science & Technology Law Review 190.
© The Author(s) 2024. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs licence (https://creativecommons.
org/licenses/by-nc-nd/4.0/), which permits non-commercial reproduction and distribution of the work, in any medium, provided the original work is not
altered or transformed in any way, and that the work is properly cited. For commercial re-use, please contact reprints@oup.com for reprints and translation
rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site–for further
information please contact journals.permissions@oup.com.
2 Journal of Intellectual Property Law & Practice, 2024, Vol. 00, No. 00
which concerns the data used to train generative AI models. These Most notably of all, training data transparency requirements are
models require immense quantities of data, with the largest train- a feature of the EU’s AI Act, which was adopted in August 2024.15
ing datasets comprising millions of text documents, images, audio This article argues that while such transparency requirements
samples, or other forms of content.6 Most of this material is pro- are not without merit, they will not by themselves rescue us from
tected by copyright, but AI developers have frequently made little the complex task of balancing the interests of rightsholders, AI
or no effort to seek the permission of rightsholders for the use developers and society as a whole. This is because, in the con-
of their works. As a result, rightsholders have brought dozens text of copyright law, transparency requirements simply facilitate
of court cases against AI developers in multiple jurisdictions on the enforcement of the law as it currently stands. Given that the
the grounds that this unauthorized use of their works consti- legality of using copyright works to train generative AI models
tutes copyright infringement.7 The AI developers counter that this varies widely between jurisdictions, the impact of transparency
use is covered by the various exceptions to the copyright holder’s requirements will thus also vary. Furthermore, given the scale of
Downloaded from https://academic.oup.com/jiplp/advance-article/doi/10.1093/jiplp/jpae102/7922541 by guest on 27 February 2025
otherwise exclusive right to authorize the reproduction of their the challenges generative AI poses to the law of copyright, it is
work.8 If rightsholders are successful in even some of these cases, unlikely that present copyright laws will, by default, adequately
the resulting damages could be sufficient to bankrupt even the address these issues in many (perhaps most) jurisdictions. Both
largest AI developers.9 Thus while generative AI serves as a ‘stress these points are illustrated by the transparency provisions of the
test’ for copyright law,10 copyright law, in turn, poses a potentially EU’s AI Act, which are explicitly designed to facilitate the enforce-
‘existential threat’ to generative AI.11 ment of the widely criticized right to ‘opt-out’ of the text and
In tandem with this increased attention regarding AI training data mining (TDM) exception under Article 4 of the Copyright in
data, developers have become markedly more secretive regarding the Digital Single Market (CDSM) Directive.16 Because the trans-
the contents of such data—a trend that few believe is coinciden- parency provisions do not address the fundamental problems
tal. As a result, organizations representing rightsholders across with the opt-out, individual creators are unlikely to see any signif-
the world are now calling for AI developers to be required by law icant material benefit from these transparency provisions—which
to be transparent regarding the contents of their training datasets, will nevertheless place additional burdens on AI developers.
with the aim of enabling rightsholders to enforce their rights Transparency requirements are therefore a necessary but
over their content.12 These calls for action have already led to insufficient condition to achieve a desirable outcome in this area.
significant policy developments. At the intergovernmental level, Policymakers should instead look beyond such requirements and
the Hiroshima AI Process Principles, agreed by the G7 nations in engage with the difficult question of how to balance the com-
2023, call for the implementation of appropriate measures to pro- peting interests of all relevant stakeholders. This is as much a
tect personal data and intellectual property, including through question of social priorities as legal mechanisms, and the answers
appropriate transparency of training datasets.13 Legislators have will depend on the specific legal, economic and cultural contexts
introduced bills that would mandate training data transparency.14 of different jurisdictions.
The remainder of this article is structured as follows. Section 2
offers a concise overview of how generative AI models are trained
using data, along with arguments in favour of training data trans-
parency and the methods by which it might be achieved. Section 3
6
Zewe (n 1). discusses the copyright implications of the unauthorized use
7
For example, the Authors Guild’s case against OpenAI in the USA, the case filed of copyrighted works in AI training data, with a focus on how
by photographer Robert Kneschke against LAION in Germany, and the claim brough by
the legality of such use (and consequently the impact of trans-
Getty Images against Stability AI in the UK; for more details on these cases and (many)
other examples, see further Mishcon de Reya, ‘Generative AI – Intellectual Property parency requirements) varies significantly between jurisdictions.
Cases and Policy Tracker’ (Mischon de Reya, 12 August 2024). Available at https://www. Section 4 provides a detailed examination of the transparency
mishcon.com/generative-ai-intellectual-property-cases-and-policy-tracker (accessed 1 requirements of the EU’s AI Act and argues that these are unlikely
November 2024).
8
to provide meaningful material benefits to individual authors.
See eg Stability AI, ‘Response to USCO Inquiry on Artificial Intelligence and Copy-
right’ (October 2023), 8; Hugging Face, ‘Hugging Face Response to the Copyright Office Section 5 concludes.
Notice of Inquiry on Artificial Intelligence and Copyright’ (November 2023), 9; Anthropic
‘Notification of Inquiry Regarding Artificial Intelligence and Copyright Public Comments
of Anthropic PBC’ (October 2023), 3; Google ‘Artificial Intelligence and Copyright’ (October 2. Generative AI and training data
2023), 8–11.
9
Elizabeth Lopatto, ‘OpenAI Searches for an Answer to its Copyright Problems’ To fully grasp the significance of the debate around copyright and
(The Verge, 30 August 2024) Available at https://www.theverge.com/2024/8/30/24230975/ training data, it is necessary to first understand how such data is
openai-publisher-deals-web-search (accessed 1 November 2024).
10
Daryl Lim, ‘Generative AI and Copyright: Principles, Priorities and Practicalities’
used to train a modern generative AI model. While AI models that
(2023) 18 Journal of Intellectual Property Law & Practice 841. could be described as ‘generative’ in some sense have existed for
11
Pamela Samuelson, ‘Generative AI Meets Copyright’ (University of California, Berke- decades, the current wave of popular generative AI models such
ley, 26 April 2023). Available at https://news.berkeley.edu/2023/05/16/generative-ai-
as OpenAI’s GPT series or Stability AI’s image generators are based
meets-copyright-law/ (accessed 1 November 2024).
12
See eg Professional Photographers of America, ‘PPA’s Comments on the Copyright’s
Office NOI on Generative Artificial Intelligence’ (Professional Photographers of Amer-
15
ica, 17 November 2023). Available at https://www.ppa.com/articles/ppas-comments- Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June
on-the-copyrights-office-noi-on-generative-artificial-intelligence (accessed 1 November 2024 laying down harmonized rules on artificial intelligence (Artificial Intelligence Act)
2024); Staff, ‘Global Principles on Artificial Intelligence (AI)’ (News/Media Alliance, [hereafter ‘AI Act’], art 53(1(c) and (d).
16
6 September 2023). Available at https://www.newsmediaalliance.org/global-principles- See further Eleonora Rosati, ‘Copyright as an Obstacle or an Enabler? A European Per-
on-artificial-intelligence-ai/ (accessed 1 November 2024); CISAC, ‘Australian Creators spective on Text and Data Mining and its Role in the Development of AI Creativity’ (2019)
Welcome Establishment of Copyright and AI Reference Group’ (CISAC, 12 December 27 Asia Pacific Law Review 198; Thomas Margoni and Martin Kretschmer, ‘A Deeper Look
2023) Available at https://www.cisac.org/Newsroom/society-news/australian-creators- into the EU Text and Data Mining Exceptions: Harmonisation, Data Ownership, and the
welcome-establishment-copyright-and-ai-reference-group (accessed 1 November 2024). Future of Technology’ (2022) 71 GRUR International 685; Paul Keller and Zuzanna Warso,
13
Hiroshima Process International Guiding Principles for Organizations Developing Defining Best Practices of Opting Out of ML Training (Open Future 2023); Gina Maria
Advanced AI System (2023), 5. Ziaja, ‘The Text and Data Mining Opt-Out in Article 4(3) CDSMD: Adequate Veto Right for
14
See, eg, California Senate Bill 942 ‘California AI Transparency Act’ and US House Bill Rightholders or a Suffocating Blanket for European Artificial Intelligence Innovations?’
7913 ‘Generative AI Copyright Disclosure Act of 2024′ . (2024) 10 Journal of Intellectual Property Law & Practice 453.
Adam Buick ⋅ Copyright and AI training data 3
on a subtype of machine learning known as ‘deep learning’.17 of copyrighted content, it is worth noting that there are also non-
Like other forms of machine learning, deep learning makes use copyright arguments in favour of training data transparency. For
of ‘neural networks’—that is, connected units of nodes inspired example, calls for training data transparency are also motivated
by the structure of the human brain. What distinguishes deep by concerns that the content of training data may lead to biased
learning from other neural network-based approaches is that it or otherwise inequitable results.26 While all AI developers claim
makes use of multiple layers of nodes, referred to as ‘deep’ lay- to test their models for such bias prior to release, training data
ers. When information is passed through these deep layers, it is transparency allows external parties—including those with differ-
processed at different levels of complexity, with early layers typ- ent perspectives and priorities from the developers—to examine
ically identifying simple patterns and subsequent layers building the training data in ways that would be beyond the scope of
on this foundation to recognize patterns of increasing complexity. any individual team.27 Relatedly, transparency may help to build
This enables AI models to ‘learn’ from large quantities of data.18 public trust in AI models by reducing the information asymme-
Downloaded from https://academic.oup.com/jiplp/advance-article/doi/10.1093/jiplp/jpae102/7922541 by guest on 27 February 2025
In a generative AI model based on deep learning, the patterns and try between model providers and consumers, thereby avoiding a
rules identified during the training process can then be leveraged ‘market for lemons’ scenario in which demand for generative AI
to create new content.19 The extent to which the final generative models collapses.28
AI model retains the data it has been trained on is not entirely Despite these arguments in favour of transparency, however,
clear. While it is generally accepted that AI models encode the AI developers have become markedly more secretive regarding
patterns derived from the data during the deep-learning process their training data in recent years. Many major AI developers
as numerical parameters rather than storing the entire training have shifted from detailed explanations of the training data used
dataset,20 in some cases generative AI models can recreate identi- to train a particular model to single sentence descriptions.29 For
cal or near-identical copies of material found within their training example, while OpenAI disclosed the main sources of data for
data—a phenomenon known as ‘memorization’.21 The copyright GPT-3, the paper introducing GPT-4 revealed only that the data
implications of this uncertainty regarding the retention of training on which the model had been trained was a mixture of ‘pub-
data are discussed further in Section 3 below. licly available data (such as internet data) and data licensed from
Modern generative AI models require truly mind-boggling third-party providers’.30 The motivations behind this move away
quantities of training data. The case of GPT-3, the breakthrough from transparency have not been articulated in any particular
Large Language Model (LLM) announced by OpenAI in 2020, pro- detail by AI developers, who in many cases have given no explana-
vides an illustration. In the initial paper describing the develop- tion at all. For its part, OpenAI justified its decision not to release
ment and capabilities of GPT-3, the authors revealed that the data further details regarding GPT-4 on the basis of concerns regarding
the model had been trained upon consisted of a refined version of ‘the competitive landscape and the safety implications of large-
the Common Crawl dataset,22 a smaller dataset of higher-quality scale models’, with no further explanation within the report.31
web-based text called WebText2,23 two datasets of books and the Some limited additional elaboration about both arguments was
English-language version of Wikipedia.24 In addition to this gen- subsequently provided by Ilya Sutskever, then OpenAI’s Chief Sci-
eral training data, an AI model may then be further trained on a entist, in an interview in March 2023. Sutskever clarified that
smaller, curated dataset in order to refine the model’s capabilities OpenAI believes that sharing further details regarding their train-
to a particular domain in order to ‘fine-tune’ it.25 As a result, a gen- ing data would facilitate the replication of their cutting-edge AI
erative AI model may have been trained on millions of individual models, while releasing detailed information regarding training
works. data would enable careless or malicious actors to develop their
own powerful AI models more easily.32
2.1 Training data transparency While these lines of reasoning have some merit, there are
As noted in the introduction, there are widespread calls for the also clear objections to both arguments from a public policy per-
developers of AI models to be transparent regarding the con- spective. Preventing rivals from replicating innovative technology
tent of their vast training datasets. While in many cases, these without investing comparable resources is a common goal for
demands are driven by concerns regarding the unauthorized use
26
Mark Lemley and Bryan Casey, ‘Fair Learning’ (2021) 99 Texas Law Review 743, 757.
27
17 Yacine Jernite, ‘Training Data Transparency in AI: Tools, Trends, and Policy Recom-
All references to ‘generative AI’ throughout this article should be understood as
mendations’ (Hugging Face, 5 December 2023). Available at https://huggingface.co/blog/
meaning deep learning-based generative AI models unless otherwise stated.
18
Ian Goodfellow, Yoshua Benigo and Aaron Courville, Deep Learning (MIT Press 2016) yjernite/data-transparency (accessed 1 November 2024).
28
For an in-depth discussion of the ‘market for lemons’ phenomenon, see George
6–8.
19
Zewe (n 1). Akerlof, ‘The Market for “Lemons”: Quality Uncertainty and the Market Mechanism’
20
Pamela Samuelson, ‘Generative AI Meets Copyright’ (2023) 381 Science 158, 159; (1970) 84 The Quarterly Journal of Economics 488. For a discussion of how the requirement
Matthew Sag, ‘Copyright Safety for Generative AI’ (2023) 61 Houston Law Review 295, to be transparent with safety and efficacy works to the benefit of companies developing
316–321. new products in other contexts, such as the pharmaceutical industry, see further Ariel
21
See further Ivo Emanuilov and Thomas Margoni, ‘Forget Me Not: Memorisation Katz, ‘Pharmaceutical Lemons: Innovation and Regulation in the Drug Industry’ (2007)
in Generative Sequence Models Training on Open Source Licensed Code’ (2024) SSRN. 14 Michigan Telecommunications & Technology Law Review 1.
29
Available at https://ssrn.com/abstract=4720990 (accessed 1 November 2024). Jernite (n 27).
30
22
The Common Crawl database consists of ‘web page data, metadata extracts, and OpenAI, ‘GPT-4 Technical Report’ (2023) arXiv preprint arXiv:2303.08774, 2.
31
text extracts’ taken from the internet since 2008; Common Crawl, ‘Overview’ (Com- ibid.
32
Sutskever said ‘On the competitive landscape front — it’s competitive out there…
mon Crawl, 2024). Available at https://commoncrawl.org/overview (accessed 1 November
GPT-4 is not easy to develop. It took pretty much all of OpenAI working together for a
2024).
23
The WebText2 dataset is composed of the text of popular outbound links from the very long time to produce this thing. And there are many many companies who want
social media site Reddit; The New York Times Company v OpenAI LP and Microsoft Corporation, to do the same thing, so from a competitive side, you can see this as a maturation of
Complaint, US District Court for the Southern District of New York, filed 27 December the field… On the safety side … [t]hese models are very potent and they’re becoming
2023, page 26. Available at https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_ more and more potent. At some point it will be quite easy, if one wanted, to cause a
Dec2023.pdf (accessed 1 November 2024). great deal of harm with those models. And as the capabilities get higher it makes sense
24
Tom Brown and others, ‘Language Models are Few-shot Learners’ (2020) arXiv that you don’t want want [sic] to disclose them’; James Vincent, ‘OpenAI co-founder on
preprint arXiv:2005.14165, 9. company’s past approach to openly sharing research: “We were wrong”’ (The Verge 15
25
Dave Bergmann, ‘What is Fine-tuning?’ (IBM, 15 March 2024), Available at https:// March 2023), Available at https://www.theverge.com/2023/3/15/23640180/openai-gpt-4-
www.ibm.com/topics/fine-tuning (accessed 1 November 2024). launch-closed-research-ilya-sutskever-interview (accessed 1 November 2024).
4 Journal of Intellectual Property Law & Practice, 2024, Vol. 00, No. 00
many firms, but one that is only occasionally in the public interest; some developers have taken this approach, making full copies of
furthermore, the absence of training data transparency require- their training data available online.36
ments could itself facilitate anti-competitive practices, such as Providing full access to the training data is unlikely to be
by enabling the largest AI developers to enter into preferential workable for the majority of AI models, however. First, there are
licensing agreements with entities with access to large pools of logistical challenges associated with hosting an accessible repos-
training data.33 And while the proliferation of potentially danger- itory of a training dataset containing hundreds of thousands or
ous AI technology is a valid concern, this argument could justify millions of individual works. Secondly, such an approach is, iron-
any measure aimed at hindering the market entry of competitors. ically, likely to come into conflict with copyright law; even if a
It is not clear why restricting access to training data would be an rightsholder has agreed for their work to be included in a dataset
especially effective way to prevent dangerous AI tools from falling through a licensing deal, they are unlikely to be happy for their
into the hands of bad actors, especially compared to restricting works to effectively be made freely available for other devel-
Downloaded from https://academic.oup.com/jiplp/advance-article/doi/10.1093/jiplp/jpae102/7922541 by guest on 27 February 2025
access to more sensitive information such as the weights of a par- opers through the fully accessible dataset. Similar issues arise
ticular AI model.34 Additionally, as noted above, failure to disclose regarding any personal data that might be contained within the
details of AI training data also has the potential to cause harm by dataset. One way to avoid the copyright and personal data issues
making it more difficult for regulators and third parties to iden- of the full access approach is to permit users to request access
tify potentially harmful or discriminatory behaviour that result to restricted parts of a datasets which can then been approved
from that data. In short, even if there are benefits to withhold- or declined by the dataset owner upon the provision and verifica-
ing information regarding training data, these come with obvious tion of credentials—this has been referred to as ‘gated access.’37
downsides; it would be inappropriate to leave the decision on how Creating and maintaining the infrastructure necessary to manage
best to balance the competing concerns solely to AI companies, gated access to a dataset, however, adds to the already significant
given that these companies have a vested interest in preventing logistical issues associated with full access.
the release of such data. Most discussions around increasing training data transparency
The official arguments that have been presented against trans- focus on the provision of some kind of summary that provides key
parency are therefore largely unconvincing; it is widely speculated information on the dataset. This is much less burdensome than
that the primary motivation behind the increasing opacity with providing direct access to the content of the dataset—but its use-
regards to training data is instead a desire by AI developers to fulness is heavily dependent on the information that is contained
avoid or minimize liability for infringement of copyright present in the summary. In theory, such a summary could contain meta-
in the training data. An investigation by the Washington Post in data on each item within a dataset—for example, the title, URL (if
April 2024 reported that many companies involved in the devel- relevant), author, date of publication etc—thus allowing individ-
opment of AI do not even keep internal records of their training ual works to be identified. However, such data are often inaccurate
data because of fears that this could be used as evidence of copy- or non-existent, especially in the case of data scraped directly
right infringement or breach of data protection law.35 While the from the internet. Providing or ensuring the accuracy of even basic
use of copyrighted content in training data is not necessarily information such as title or author for individual items would be
infringement in many jurisdictions, as further discussed below, resource intensive, potentially driving out smaller developers and
AI developers certainly have little to gain by inviting potential thereby increasing market concentration.38 A number of frame-
liability in this area. works for providing detailed summaries of training data without
listing individual items have already been developed within the
AI community—for example, The Dataset Nutrition Label and
2.2 Models of transparency Datasheets for Datasets.39
Training data transparency has multiple benefits, and the argu- From the perspective of a rightsholder concerned that their
ments against providing such transparency not particularly per- works may have been used without their permission, trans-
suasive. How, then, can such transparency be achieved? There parency is useful chiefly to the extent that it enables them to
are several different approaches, which provide varying levels of establish whether or not a particular work appears in a given
information regarding the contents of the data in question. The dataset. If training data summaries do not identify individual
approach that achieves the highest degree of transparency is for works, rightsholders will at least want a clear explanation of
AI developers to make the datasets used to train a particular AI the sources of the data used—for example, existing datasets, the
model fully publicly accessible. Under this ‘full access’ approach, domains of data scraped from the internet, etc—to assess the like-
third parties (including rightsholders) can then view the training lihood that their works were used to train a specific AI model. This
data themselves, and independently verify the content that has assumes, however, that once the unauthorized use of a given work
been used. Given that an AI developer must have access to the full has been identified, bringing and succeeding with a copyright
dataset in order to complete the training process, this approach infringement claim will be straightforward. As discussed in the
should always be possible in principle (assuming that the devel- following section, in practice the situation is more complicated,
oper has not deleted the data once training is complete). Indeed, and varies significantly between jurisdictions.
33
Zuzanna Warso, Maximilian Gahntz and Paul Keller, Sufficiently Detailed? A Proposal
for Implementing the AI Act’s Training Data Transparency Requirements for GPAI (Open Future,
2024). 36
See eg Zhengzhong Liu and others, ‘LLM360: Towards Fully Transparent Open-Source
34
Lawrence Lessig, ‘Not all AI Models should be Freely Available, Argues a Legal LLMs’ (2023) arXiv preprint arXiv:2312.06550.
Scholar’ (The Economist, 29 July 2024), Available at https://www.economist.com/by- 37
Aleck Tarkowski and Zuzanna Warso, Commons-Based Data Set Governance for AI (Open
invitation/2024/07/29/not-all-ai-models-should-be-freely-available-argues-a-legal- Future, 2024), 10.
38
scholar (accessed 1 November 2024). Katharina de la Durantaye, ‘Garbage in, Garbage Out’ (2023), 17. Available at https://
35
Kevin Schaul, Szu Yu Chen and Nitasha Tiku, ‘Inside the Secret List of Web- papers.ssrn.com/sol3/papers.cfm?abstract_id=4572952 (accessed 1 November 2024).
39
sites that make AI like ChatGPT Sound Smart’ (The Washington Post, 19 April 2024), Sarah Holland and others, ‘The Dataset Nutrition Label’ (2018) 12 Data Protection
Available at https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot- and Privacy 1; Timnit Gebru and others, ‘Datasheets for Datasets’ (2021) 64 Communica-
learning/ (accessed 1 November 2024). tions of the ACM 86.
Adam Buick ⋅ Copyright and AI training data 5
3. Training data and copyright numerical parameters. For this reason, many academic commen-
3.1 The use of copyright content in machine tators have concluded that, in most cases, generative AI models
learning cannot be considered to infringe the copyright in any of the works
they were trained upon by their mere existence.48 However, this
As a starting point, it is certainly the case that a large majority of
view is complicated somewhat by the phenomenon of memo-
the content used to train most generative AI models will be pro-
rization, whereby generative AI models can sometimes reproduce
tected by the law of copyright. This is an inevitable result of the
verbatim or near-verbatim portions of their training data.49 It is
fact that copyright arises automatically for any work that meets
therefore conceivable that a court might still find a generative AI
a minimal set of requirements,40 and that the term of this pro-
model to ‘contain’ a work on the basis of its ability to reproduce
tection is lengthy—at least the life of the author plus 50 years.41
that work, even if the data is not stored within the model’s mem-
While alternatives to making use of works that raise copyright
ory as it would be on a hard drive. Memorization also raises the
Downloaded from https://academic.oup.com/jiplp/advance-article/doi/10.1093/jiplp/jpae102/7922541 by guest on 27 February 2025
issues do exist, these are not a viable replacement for the use
issue of generative AI models infringing copyright through their
of copyright works, at least at the time of writing. For example,
output; if the output of a model directly reproduces some part of
while there is a large body of public domain works for which
its training data, this could clearly be a potential infringement of
copyright has expired, almost all of these works will date from
the right of reproduction.50 However, even if a model’s output is
before the 1950s—any AI trained exclusively on such works would
not a verbatim or near-verbatim copy any part of its training data,
therefore be hopelessly outdated.42 Some copyrighted material,
it might nevertheless infringe copyright if it contains recogniz-
such as Wikipedia, is licensed under Creative Commons (CC) or
able protected elements of a work, such as a fictional character.51
other ‘copyleft’ licences which may permit the use of that con-
Beyond the reproduction right, model output might also infringe
tent for training generative AI models.43 However, since only a
other exclusive rights, such as the right to authorize translations
small fraction of copyright protected content has been made
of a work, adaptions of a work, or to communicate a work to the
available under such licences, it is unlikely that cutting-edge gen-
public.52
erative AI models could be trained exclusively on CC-licenced
For the purposes of this article, however, the most important
content.44 Additionally, the output of AI models trained using
issue is the fact that, in the majority of cases, the training data
such content might be bound by the ‘Share Alike’ obligations CC
must be reproduced at least once as part of the training pro-
licences often impose on derivative works—something that com-
cess.53 This is at least arguably a prima facie infringement of the
mercial AI developers would likely wish to avoid.45 It has also
right of reproduction. It should be noted that some commentators
been suggested that at some point in the future some or all of
have argued that both acts of temporary electronic reproduction
the real-world data used to train AI systems could be replaced
and ‘non-expressive’ uses of works should not fall within the
with ‘synthetic data’—that is, AI-generated data that is intended
scope of copyright protection at all; if this were the case, most
to closely resemble real-world data and thus act as a replacement
(if not all) of the copying involved in the training of a generative
for it.46 However, AI experts are divided on whether synthetic data
AI model would simply fall outside the scope of copyright pro-
will ever be able to meaningfully replace real-world data, and if
tection entirely.54 However, most of the current debate around
so, when this will be possible.47
the use of copyrighted materials in AI training data is based on the
The fact that most AI training data are protected by copyright
assumption that the reproductions which take place during the
raises a number of problems. As noted above, deep learning-based
training of a generative AI model are indeed copyright-relevant
generative AI models do not typically store copies of their training
acts, and therefore require the permission of the rightsholder
data—rather, the patterns derived from the data are encoded as
unless a relevant exception applies—a position that has already
been confirmed by official sources in both the UK and EU.55
40
Berne Convention for the Protection of Literary and Artistic Works (1886) art 5. Clearing the rights for the extremely large number of works
41
Agreement on Trade-Related Aspects of Intellectual Property (1994) art 12. In used to train an AI model would be exceedingly difficult. Even
many developed countries, including the USA, UK, and EU Member States, the term of
setting aside the expense of paying some kind of licence fee
protection is the life of the author plus 70 years.
42
Aside from the fact that they are less likely to have been digitized, such works will be for the use of each work, the transaction costs associated
ignorant of modern developments, will use outdated language and will be less likely to
represent authors from marginalized backgrounds. Moreover, they will be significantly
more likely to contain views that we would now rightly recognise as abhorrent. Sag (n 48
De la Durantaye (n 38), 4–6; Sag (n 20), 313–25; Szkalej and Senftleben (n 45), 8.
20), 338. 49
See further Nicholas Carlini et al, ‘Quantifying Memorization Across Neural Lan-
43
This depends on the conditions of a particular CC license. In their online FAQ, the guage Models’ (2022) arXiv preprint arXiv:2202.07646.
Creative Commons organization notes that ‘If someone uses a CC-licensed work with 50
The right of reproduction is set out in the Berne Convention (1886) at art 9(1). Most
any new or developing technology, and if copyright permission is required, then the (but not all) of the Berne Convention member states have gone on to sign the WIPO
CC license allows that use without the need to seek permission from the copyright Copyright Treaty (1996), the Agreed Statements to which clarify that ‘the storage of a
owner so long as the license conditions are respected’ [emphasis added]; Creative protected work in digital form in an electronic medium constitutes a reproduction within
Commons, ‘Frequently Asked Questions’ (Creative Commons, 6 June 2024) Available the meaning of Article 9 of the Berne Convention’; WIPO, ‘Agreed statements concerning
at https://creativecommons.org/faq/#what-are-the-limits-on-how-cc-licensed-works- the WIPO Copyright Treaty’ (20 December 1996) TRT/WCT/002, 1.
51
can-be-used-in-the-development-of-new-technologies-such-as-training-of-artificial- Matthew Sag has dubbed this as the ‘Snoopy Problem.’ Sag (n 20), 327.
52
intelligence-software (accessed 1 November 2024). The exclusive rights to authorise the translation of a work or any adaptions of a
44
Although for an interesting attempt to overcome this problem, see Aaron Gokaslan work are set out at art 8 and art 12, respectively of the Berne Convention (1886), while
et al, ‘CommonCanvas: Open Diffusion Models Trained on Creative-Commons Image’ the exclusive right to communicate a work to the public is set out at art 8 of the WIPO
(2024) Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- Copyright Treat (1996).
53
nition 8250. Lemley and Casey (n 26), 753.
54
45
See further Kacper Szkalej and Martin Senftleben, ‘Generative AI and Creative See eg Jenny Quang, ‘Does training AI violate copyright law?’ (2021) 36 Berkeley Tech-
Commons Licences: The Application of Share Alike Obligation to Trained Models, nology Law Review 1407; Matthew Jockers, Matthew Sag and Jason Schultz, ‘Don’t let
Curated Datasets and AI Output’ (2024). Available at https://papers.ssrn.com/sol3/ copyright block data mining’ (2012) 490 Nature 29.
55
papers.cfm?abstract_id=4872366 (accessed 1 November 2024). Note that as things cur- The UK Intellectual Property Office, ‘Consultation outcome
rently stand, CC licenses generally state that ‘Share Alike’ obligations do not apply if the Artificial intelligence call for views: copyright and related rights’ (Gov.uk, 23 March
work is used under a copyright exception; Szkalej and Senftleben (2024), 12. 2021), Available at https://www.gov.uk/government/consultations/artificial-intelligence-
46
James Jordan and others, ‘Synthetic Data—What, Why and How?’ (2022) arXiv and-intellectual-property-call-for-views/artificial-intelligence-call-for-views-copyright-
preprint arXiv:2205.03257, 4. and-related-rights (accessed 1 November 2024); as discussed below, this is also
47
ibid 36. confirmed by the EU AI Act at Recital 105.
6 Journal of Intellectual Property Law & Practice, 2024, Vol. 00, No. 00
with identifying and negotiating with individual rightsholders available online’ at Article 4(3).65 The preamble to the Directive
would be prohibitive.56 As already noted, AI developers have specifies that machine-readable means are the only appropriate
largely sidestepped this problem by simply making use of con- means of reserving rights for content made publicly available
tent without any meaningful effort to identify or seek permission online.66
from the rightsholders.57 This unauthorized reproduction of the This effective ‘veto’ over the use of a work for commercial TDM
copyrighted works forms the basis of the case against the AI purposes was included to strengthen the position of rightsholders,
developers in the majority of the ongoing training data litigation.58 theoretically allowing them to negotiate access to their works.67
This has led to criticism that the opt-out will inhibit the AI indus-
3.2 Training data and copyright exceptions try in the EU by increasing the cost of developing new AI models.68
As noted in the introduction, the defence of the AI developers to However, there is considerable doubt as to the effectiveness of
the opt-out in achieving its goal in practice. Commentators have
Downloaded from https://academic.oup.com/jiplp/advance-article/doi/10.1093/jiplp/jpae102/7922541 by guest on 27 February 2025
the accusations of mass infringement lies in the exceptions and
limitations to copyright protection.59 However, as detailed below, identified two major barriers to the use of the opt-out. Firstly,
the scope and application of these exceptions and limitations there are currently no generally recognized standards or proto-
differs considerably between jurisdictions—which in turn pro- cols for a machine-readable means of opting out of the Article
foundly influences the potential impact of introducing require- 4 exception; consequently, there is no means for rightsholders
ments for training data transparency. to consistently reserve their rights, particularly for online con-
tent.69 Secondly, and more fundamentally, unless rightsholders
are aware that their work has been used for the purposes of TDM,
3.2.1 The EU
they have no way of knowing whether or not their opt-out has
Under EU law, Member States must provide a closed system of been respected.70 This significantly limits the usefulness of the
copyright exceptions—that is, the reproduction of a work without opt-out in practice, and consequently undermines the bargaining
prior authorization will only be permitted if it falls within one of power that the provision was meant to provide to rightsholders.
several specific exceptions. While a number of such exceptions As discussed further in Section 4, the transparency provisions of
are relevant to the process of training a generative AI model,60 the EU’s AI Act are explicitly aimed at addressing this deficiency.
the most significant of these comes from Articles 3 and 4 of the
2019 CDSM Directive, which permit ‘reproductions and extrac-
3.2.2 The USA
tions of lawfully accessible works and other subject matter’ for
In the USA, in contrast, the copyright and training data question
the purposes of text and data mining (TDM). While the CDSM
turns on the doctrine of fair use. Fair use is an open exception
Directive preceded the current hype around generative AI by sev-
to copyright, meaning there is no predefined list of activities per-
eral years, TDM is given a broad definition which covers most
mitted by the defence; rather, the fairness of a particular use
forms of machine learning,61 and the AI Act explicitly acknowl-
must be addressed on a case-by-case basis. US jurisprudence
edges the relevance of the TDM exceptions as relevant to the AI
emphasizes four factors as especially important in making such
training process.62 Article 3 of the CDSM Directive permits TDM
a determination—these are the purpose and character of the use,
by ‘research organizations and cultural heritage institutions’ for
the nature of the copyrighted work, the amount and substantial-
the purposes of scientific research,63 while Article 4 permits TDM
ity of the portion used in relation to the copyrighted work as a
for any purpose—including by private companies for commercial
whole, and the effect of the use upon the potential market for or
reasons.64 Significantly, however, the latter of these exceptions is
value of the copyrighted work.71
predicated on the condition that the works and other subject mat-
It is not yet clear under what circumstances, if any, the use of
ter being used for the purposes of TDM have not been ‘expressly
copyrighted material for the purposes of training a generative AI
reserved by their rightholders in an appropriate manner such as
system will constitute fair use under US law. However, jurispru-
machine-readable means in the case of content made publicly
dence over the past 20 years has shown that the ‘purpose and
character’ factor is especially important; almost all high-profile
56
Lemley and Casey (n 26), 759. fair use cases in that time period have turned on the question
57
Lopatto (n 9). of whether the purpose and character of the use in question can
58
Mishcon de Reya (n 7).
59
See eg Anthropic PBC, Notification of Inquiry Regarding Artificial Intelligence and Copy-
be said to be ‘transformative.’72 A use is transformative if it ‘adds
right, Public Comments of Anthropic PBC (Anthropic PBC 2023), Available at http://www. something new, with a further purpose or different character,
openfuture.eu/wp-content/uploads/2023/11/231111_copyright_offoce_noi_anthropic. altering the first with new expression, meaning, or message’.73
pdf (accessed 1 November 2024).
60
This would seem to support the argument that the use of a work
For example, the exception for ‘temporary acts of reproduction’ under art 5(1) of
Directive 2001/29/EC of the European Parliament and of the Council of 22 May 2001 on to train a generative AI model is covered by the defence, given that
the harmonization of certain aspects of copyright and related rights in the information the purpose of such training is to allow the model to generate new
society [InfoSoc Directive]. content. There is precedent for finding large-scale machine copy-
61
The definition given states that text and data mining means ‘any automated ana-
ing to be fair use based on its transformational character, most
lytical technique aimed at analysing text and data in digital form in order to generate
information which includes but is not limited to patterns, trends and correlations’; Direc-
tive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on
65
CDSM Directive art 4(3).
copyright and related rights in the Digital Single Market [hereafter CDSM Directive] art 66
CDSM Directive Recital 18.
2(2). 67
62
Ziaja (n 16), 454.
AI Act Recital 105. 68
63
Artha Dermawan, ‘Text and Data Mining Exceptions in the Development of Genera-
While generally better received than the art 4 exception, art 3 of the CDSM Directive
tive AI Models: What the EU Member States could Learn from the Japanese “Nonenjoy-
has been criticised on the grounds that researchers not affiliated with a research organ-
ment” Purposes?’ (2023) 27 The Journal of World Intellectual Property 44, 53.
isation or cultural heritage institution from benefitting from the exception, even if they 69
However, this may change in future; see further Paul Keller and Zuzanna Warso,
are operating in the same manner as their institutionally-affiliated peers; Christophe
Defining best practices of opting out of ML training (Open Future 2023).
Geiger, Giancarlo Frosio and Oleksandr Bulayenko, ‘Text and Data Mining: Articles 3 70
Ziaja (n 16), 456.
and 4 of the Directive 2019/790/EU’ (2019) Centre for International Intellectual Property 71
17 USC s 107.
72
Studies Research Paper No 2019–08, 32. Available at https://papers.ssrn.com/sol3/papers. Daniel Gervais, ‘A Social Utility Conception of Fair Use’ (2022) Vanderbilt Law Research
cfm?abstract_id=3470653 (accessed 1 November 2024). Paper 22–35, 3.
64 73
CDSM Directive arts 3–4. Campbell v. Acuff-Rose Music, Inc 510 US 569 (1994).
Adam Buick ⋅ Copyright and AI training data 7
notably the Authors Guild v HathiTrust74 and Authors Guild v Google75 a specific purpose, not conflict with the normal exploitation of
cases. Of course, this must be set against the fact that the use of the work, and not unreasonably harm the legitimate interests of
a work as part of the training data for a generative AI model is the author.82 Generally speaking, a generative AI model will be
usually done for commercial reasons, involves that work being capable, inter alia, of creating content that is similar to that on
copied in its entirety, and may reduce the demand for the origi- which it was trained; such content is a potential substitute for the
nal work by enabling the production of similar, competing works. original, and may therefore have a negative impact on its value.
It is worth noting that in Google, the transformative nature of the This means that exceptions that permit the unauthorized use of
use was seen as enough to outweigh the fact that Google is a for- copyrighted materials for the purposes of training generative AI
profit company and that the entire works were copied,76 although models may fall foul of the third element of the Berne three-step
the machine copying in that case did not raise the same concerns test, especially when the models are being trained for commer-
about market impact. cial purposes.83 A need to take the potential economic impact
Downloaded from https://academic.oup.com/jiplp/advance-article/doi/10.1093/jiplp/jpae102/7922541 by guest on 27 February 2025
on the rightsholder is explicitly considered in the relevant excep-
3.2.3 Other jurisdictions tion in some jurisdictions.84 However, copyright exceptions in all
The approach to AI-relevant copyright exceptions varies fur- Berne Convention signatory countries are vulnerable to this line
ther amongst other jurisdictions. Some have introduced bespoke of argument. An interpretation of the relationship between excep-
exceptions that enable machine learning from copyrighted mate- tions relevant to AI training and the three-step test that leads
rials, like the EU—although generally without the option of to further harmonization may eventually emerge from an inter-
opting-out. Japan, for example, introduced an exception in 2018 national body, such as a WTO Dispute Settlement Panel. Until
that permits the use of copyright materials without rightsholder then, however, the test will be interpreted and applied by national
permission for a broad range of computing-related uses, includ- courts, likely leading to further divergence in national approaches
ing TDM.77 Similarly, Singapore has also recently introduced a new to these exceptions.
copyright exception that permits copying for ‘computational data
analysis.’78 A number of jurisdictions have open-ended copyright 3.3 The varied impact of training data
exceptions that may or may not apply to the training of AI mod- transparency
els, as in the USA—significantly, this group includes China, the The lawfulness of using copyrighted materials without prior
world’s second most important hub for AI development.79 In some authorization from the rightsholder therefore varies significantly
countries with such open exceptions, governments have moved to between legal systems—in some, the reproductions involved in
clarify that the use of copyright materials in training data is cov- the training process may not even constitute copyright-relevant
ered by existing exceptions—for example, the Israeli Ministry of acts, while in others these will require the explicit permission of
Justice has issued an opinion that in most circumstances, the use the rightsholder outside of limited exceptions, with most jurisdic-
of copyrighted materials is permitted under the existing fair use tions falling somewhere in between. Consequently, the impact of
doctrines of Israeli copyright law.80 A third group of jurisdictions a legal requirement for training data transparency will also vary.
have a closed list of exceptions that either do not permit, or place For example, regardless of whether or not the use of works in
significant limitations on, TDM or other machine-learning related training data is found to be covered by fair use under US law,
uses of copyrighted materials. The UK, for example, permits TDM transparency requirements would have a very different impact
of copyrighted works only for non-commercial uses.81 in the USA compared to the EU, since fair use does not allow for
rightsholders to opt-out while the EU’s Article 4(3) CDSM excep-
3.2.4 Likelihood of future divergence tion does.85 While transparency can offer a valuable tool for
This range of approaches to the unauthorized use of copyrighted the scrutiny of the content used in training data, its impact is
materials in training data is likely to diverge further, both because ultimately constrained by the underlying copyright framework
some governments will likely seek to introduce further exceptions within a particular jurisdiction. This exposes an apparent flaw in
to facilitate training in order to promote the AI industry domesti- the reasoning of the pro-rightsholder organisations mentioned in
cally, and because challenges to pro-training copyright exceptions the introduction, whose advocacy for training data transparency
are probable. Particularly relevant to this second point is the fact requirements across various jurisdictions appears to be based
that in the overwhelming majority of countries, any exception on the assumption that similar requirements will lead to similar
to the reproduction right must conform to the ‘three-step test’ (pro-rightsholder) outcomes, irrespective of the surrounding legal
under the Berne Convention; that is, the exception must be for context.
74
75
Authors Guild v HathiTrust 755 F 3d 87 (2d Cir 2014).
Authors Guild v Google 804 F 3d 202 (2d Cir 2015).
4. Training data transparency in the AI Act
76
ibid. The training data provisions of the EU’s recently passed AI Act
77
Japan, Amendment of the Copyright Act 2018, art 30–4. As Tatsuhiro Ueno notes, this
exemplify both how the impact of transparency requirements is
exception is broader than that found in the CDSM Directive since it applies both to com-
mercial and non-commercial uses, does not permit opt-outs from rightsholders, permits
exploitation ‘by any means’, and does not require ‘lawful access’. This is balanced by the
82
fact that the exception does not apply if the exploitation ‘would unreasonably prejudice Berne Convention (1886) art 9(2). For further discussion on the Berne three-step test,
the interests of the copyright owner.’ Tatsuhiro Ueno, ‘The flexible copyright exception especially as it applies to TDM provisions, see Eleonora Rosati, ‘No Step-free Copyright
for “non-enjoyment” purposes – recent amendments in Japan and its implication’ (2021) Exceptions: The Role of the Three-step in Defining Permitted Uses of Protected Content
70 GRUR International 145. (Including TDM for AI-Training Purposes)’ (2024) 46 European Intellectual Property Review
78
Singapore, Copyright Act 2021, s 244(2)(a). See further David Tan, ‘Designing a Future- 262.
83
Ready Copyright Regime in Singapore: Quick Wins and Missed Opportunities’ (2021) 70 Saliltorn Thongmeensuk, ‘Rethinking Copyright Exceptions in the Era of Generative
GRUR International 1131. AI: Balancing Innovation and Intellectual Property Protection’ (2024) The Journal of World
79
Yudong Chen, ‘The Legality of Artificial Intelligence’s Unauthorised Use of Copy- Intellectual Property 278, 284.
84
righted Materials under China and U.S. Law’ (2023) 63 IDEA 241, 260. For example, under the fourth element of the fair use test in the USA. Similarly, the
80
Israel, Ministry of Justice, Opinion: Uses of Copyrighted Materials for Machine Learning Japanese exception only applies if the use in question would not unreasonably prejudice
(2022), 3. the interests of the copyright owner; Amendment of the Copyright Act 2018, art 30–4.
81 85
UK, Copyright, Designs and Patents Act 1988 section 29A. de la Durantaye (n 38), 7.
8 Journal of Intellectual Property Law & Practice, 2024, Vol. 00, No. 00
determined by a jurisdiction’s existing copyright laws and the lim- Along with the other obligations in Article 53, the training data
itations of implementing transparency requirements without also transparency and copyright policy provisions apply to providers
appropriately revisiting and, if necessary, revising those laws. As of general-purpose AI models. ‘Providers’ are defined elsewhere in
noted above, the transparency provisions contained within the the AI Act as ‘… a natural or legal person, public authority, agency
AI Act are explicitly intended to facilitate the opt-out mecha- or other body that develops an AI system or a general-purpose AI
nism contained in Article 4(3) of the CDSM Directive. While these model or that has an AI system or a general-purpose AI model
requirements will likely provide some useful information on the developed and places it on the market or puts the AI system into
sources of the training data of AI models deployed in the EU, they service under its own name or trade mark, whether for payment
are unlikely to meaningfully improve the position of rightshold- or free of charge.’91 A GPAI model is defined as one that ‘displays
ers due in part to the pre-existing flaws in the opt-out they are significant generality and is capable of competently performing
designed to give effect to. a wide range of distinct tasks regardless of the way the model is
Downloaded from https://academic.oup.com/jiplp/advance-article/doi/10.1093/jiplp/jpae102/7922541 by guest on 27 February 2025
The transparency requirements of the AI Act are only one small placed on the market and that can be integrated into a variety of
part of a very large piece of legislation that runs to 180 recitals and downstream systems or applications.’92 It is worth noting that AI
113 articles, most of which is formulated as product safety/con- Act distinguishes between AI models and AI systems; a GPAI model
sumer protection legislation and does not engage with intellectual is an essential component of a GPAI system, but does not become
property law.86 Indeed, the initial proposal for the AI Act did a GPAI system without the addition of further components such
not include provisions addressing copyright law or training data as a user interface.93
transparency at all.87 However, following the surge of public inter-
est in generative AI that accompanied the release of ChatGPT 4.1 Meaning of Article 53(1)(c) and (d)
in November 2022, groups representing the creative industries While brief, the requirements of Article 53(1)(c) and (d) raise a
and other rightsholders demanded that measures to prevent the number of important questions. Perhaps the most fundamental
unauthorized use of their content to train generative AI be added of these is exactly what information is required for a summary of
to the AI Act.88 This resulted in the inclusion of a provision in the training content to be considered ‘sufficiently detailed’. Some
the negotiating position adopted by the European Parliament in clarification is provided by the preamble, particularly Recital 107,
June 2023, which would have required providers of generative AI which states in part that:
to ‘document and make publicly available a sufficiently detailed
summary of the use of training data protected under copyright While taking into due account the need to protect trade secrets
law.’89 and confidential information, this summary should be gener-
Critics, however, were quick to observe that providing even a ally comprehensive in its scope instead of technically detailed
‘summary’ of the copyrighted materials used in a training dataset to facilitate parties with legitimate interests, including copy-
would be unworkable in practice, given the vast number of indi- right holders, to exercise and enforce their rights under Union
vidual works used, the low requirements for copyright protection law, for example by listing the main data collections or sets
to arise, and the fact that most copyright works are not actively that went into training the model, such as large private or pub-
managed by their owners.90 It would appear that this criticism lic databases or data archives, and by providing a narrative
was also recognized by the drafters of the Act; in the final ver- explanation about other data sources used.94
sion, the provision on AI training data transparency and copyright
has been split into two closely related provisions at Article 53(1)(c)
Despite this guidance, considerable uncertainty remains as
and (d).
to what is required of GPAI model providers. The summaries
Article 53(1)(c) requires providers of general-purpose AI models
under Article 53(1)(d) are clearly intended to give a broad overview
to ‘put in place a policy to comply with Union law on copy-
of the sources of the data in the training data rather than a
right and related rights, and in particular to identify and comply
detailed breakdown of the specific works used, yet must also
with, including through state-of-the-art technologies, a reserva-
contain enough information to allow rightsholders, as well as
tion of rights expressed pursuant to Article 4(3) of Directive (EU)
other parties with legitimate interests (a term that is not defined),
2019/790’. Article 53(1)(d) requires providers of general-purpose
to exercise and enforce their rights under Union law. While the
AI models to ‘draw up and make publicly available a sufficiently
recital stresses the importance of protecting the trade secrets
detailed summary about the content used for training of the
and other confidential information of AI developers, this pro-
general-purpose AI model, according to a template provided by
tection should not be an overriding concern; an attempt by
the AI Office.’
the drafters of the AI Act to balance the interests of rightsh-
olders and others against those of AI developers. The extent
86
Alexander Peukert, ‘Copyright in the Artificial Intelligence Act – A Primer’ (2024) 73 to which this balance has been achieved will become clearer
GRUR International 497.
87 when the AI Office releases its template for the summary (which
European Commission, ‘Proposal for a Regulation of the European Parliament and of
the Council laying down harmonised rules on Artificial Intelligence (Artificial Intelligence according to the Preamble should be ‘simple, effective and
Act) and amending certain Union legislative acts’ COM/2021/206 final. allow the provider to provide the required summary in narrative
88
See eg Communia, ‘Policy paper #15 on using copyrighted works for teaching the
form’).95
machine’ (Communia, 26 April 2023), Available at https://communia-association.org/
The requirements of Article 53(1)(c) are clearer, as they specif-
policy-paper/policy-paper-15-on-using-copyrighted-works-for-teaching-the-machine/
(accessed 1 November 2024); Authors’ Rights Initiative, ‘Call for Safeguards Around ically compel AI providers to respect the opt-out in Article 4(3) of
Generative AI’ (Authors’ Rights Initiative, 19 April 2023), Available at https://urheber. the CDSM Directive. As noted at Section 3.2.1, the relevance of this
info/diskurs/call-for-safeguards-around-generative-ai (accessed 1 November 2024). exception to generative AI is confirmed by the preamble, which
89
Amendments adopted by the European Parliament on 14 June 2023 on the proposal
for a regulation of the European Parliament and of the Council on laying down harmo-
nized rules on artificial intelligence (Artificial Intelligence Act) and amending certain 91
AI Act art 3(3).
Union legislative acts, art 28b(4)(a). 92
ibid art 3(63).
90
João Pedro Quintais, ‘Generative AI, copyright and the AI Act’ (Kluwer Copyright Blog, 93
AI Act Recital 97.
9 May 2023), Available at https://copyrightblog.kluweriplaw.com/2023/05/09/generative- 94
ibid 107.
95
ai-copyright-and-the-ai-act/ (accessed 1 November 2024); de la Durantaye (n 38), 16–17. ibid.
Adam Buick ⋅ Copyright and AI training data 9
clarifies that the use of copyright-protected content for the pur- material benefits to individual rightsholders. Three major obsta-
poses of training and development of GPAI models requires ‘the cles to the effectiveness of the provisions in achieving their goals
authorisation of the rightsholder concerned unless relevant copy- stand out: the lack level of detail offered by the training data
right exceptions and limitations apply’.96 The preamble further summaries, challenges enforcing the provisions in the case of AI
acknowledges that, while Directive 2019/790 introduces excep- models developed outside of the EU, and technical and logistical
tions and limitations for the purpose of text and data mining, issues relating to the implementation of Article 4(3) of the CDSM
rightsholders can opt out of such TDM unless done for the pur- Directive.
poses of scientific research, and that where the right to opt out
has been exercised, providers of general-purpose AI models need 4.2.1 The lack of detail offered by the training data
to obtain rightsholder authorization.97 summaries
The mention of ‘state-of-the-art’ technologies in Article 53(1)(c)
Downloaded from https://academic.oup.com/jiplp/advance-article/doi/10.1093/jiplp/jpae102/7922541 by guest on 27 February 2025
The requirement to provide summaries of training data at least
links to the CDSM Directive’s objective that machine-readable nominally addresses one of the major issues with Article 4(3) of
means be used to express the opt-out. The use of the term the CDSM Directive by assisting rightsholders in verifying whether
‘state-of-the-art’ suggests that GPAI model providers must con- their opt-outs have been respected. However, even without know-
tinually update these means as the technology facilitating ing the level of detail that will be required by the AI Office’s
the identification and compliance with the opt-out improve— template, there is reason to doubt how useful these summaries
although as noted above, there is still no generally recognized will be. As noted at Section 2.2, training data transparency ben-
standard for exercising the opt-out in Article 4(3) at time of efits rightsholders largely to the extent that it enables them to
writing. establish whether or not a particular work appears in a given
It is not clear, however, what precisely is meant by the refer- dataset. As the preamble specifies that summaries should not
ences to ‘Union law on copyright and related rights’98 and ‘Union be so detailed as to identify individual works, it is unclear how
copyright law.’99 As Alexander Peukert observes, while the copy- rightsholders will be able to determine whether any reservation
right laws of Member States have become increasingly harmo- of rights on their part has been respected. An obvious solution
nized over the last three decades, each Member State has its own would be for AI providers to retain detailed records of the train-
national copyright regime. There is also no copyright equivalent ing data used in addition to the public summaries, in case of
of the unitary EU trade mark or Community design: the exclu- challenges from regulators or rightsholders. However, this is com-
sive rights granted by a national copyright apply only within a plicated by the fact that under Article 4(2) of the CDSM Directive,
country’s territory.100 At present, it is ambiguous as to whether reproductions and extractions of works made under Article 4 may
the obligation to comply with Union copyright law should be only be retained ‘for as long as is necessary for the purposes
interpreted as referring to an obligation to respect the collective of text and data mining,’105 meaning that the copies should be
national copyright laws of each Member State, or to comply only deleted once the training process is completed.106 The matter of
with those elements of copyright law that have been harmonized how rightsholders will be able to prove whether or not their works
by EU law. Whatever is meant by ‘Union copyright law’, there is have been used in a particular training dataset therefore remains
nothing in Article 53(1)(c) or the preamble to suggest this only outstanding.
applies to input data (although this was clearly the chief concern
of the drafters). As such, policies to comply with Union copyright
4.2.2 Challenges enforcing the provisions in the case of AI
law will presumably need to also consider infringement through
models developed outside of the EU
output, as discussed at Section 3.1.101
At present, nearly all major centres of the AI industry are located
outside of the EU.107 This invites the question as to how Article
4.2 Evaluating the effectiveness of Article 53(1)(c)
53(1)(c) and (d) can be effectively enforced for the vast majority of
and (d)
generative AI models that are developed somewhere other than
The AI Act’s preamble clarifies that the goals of the transparency the EU. The preamble attempts to address this by stating that
and copyright protection provisions are to protect the interests of the policy to comply with Union copyright laws and give effect to
rightsholders as well as other parties with a ‘legitimate interest’ in the reservation of rights per Article 4(3) CDSM Directive applies
the contents of training data.102 Article 53 of the AI Act will apply to any provider that places a general-purpose AI model on the
12 months after the date of the entry into force of the AI Act,103 EU market, regardless of where the training took place, in order
although codes of practice covering the obligations in Article 53, to ensure a ‘level playing field’ and prevent providers gaining a
including the ‘adequate level of detail for the summary about the competitive advantage by training their models outside of the
content used for training’, will be ready no later than nine months EU.108 It is not clear, however, that this approach will produce the
after the Act’s entry into force.104 The full implications of Article intended result. As has already been observed, copyright is terri-
53(1)(c) and (d) will only become apparent once this clarification torial in nature, and AI models are generally understood not to
of the requirements arrives. contain the works they have been trained upon.109 As a result,
However, even without knowing the full details of what the
provisions will require, there are compelling grounds for scepti-
105
cism as to whether Article 53(1)(c) and (d) will deliver meaningful CDSM Directive art 4(2).
106
Rossana Ducato and Alain Strowel, ‘Ensuring Text and Data Mining: Remaining
Issues with the EU Copyright Exceptions and Possible Ways Out’ (2021) 43 European
Intellectual Property Review 322, 328.
96
ibid 105. 107
The Economist, ‘Europe, a Laggard in AI, Seizes the Lead in its Regulation’ (The
97
ibid.
98 Economist, 10 December 2023), Available at https://www.economist.com/europe/2023/12/
AI Act art 53(1)(c).
99 10/europe-a-laggard-in-ai-seizes-the-lead-in-its-regulation (accessed 1 November 2024).
AI Act Recital 104. 108
100 AI Act Recital 106.
Peukert (n 86), 504. 109
101 However, it is important to keep in mind the memorization issue discussed in s X,
ibid 507.
102
AI Act Recitals 104–108. above. The CJEU has established that the reproduction of very short excerpts of a work
103
AI Act art 113(b). (such as an 11-word headline in Infopaq C-5/08 and a two-second music clip in Pelham
104
AI Act art 56. C-476/17) can still amount to copyright infringement. As such, if even minor remnants
10 Journal of Intellectual Property Law & Practice, 2024, Vol. 00, No. 00
if the unauthorized reproductions of copyrighted material neces- payment of a particular fee) under which they would be pre-
sary for the training of an AI model were to be carried out entirely pared to waive their opt-out in machine-readable form, which
in a third country whose law permits such use without rightsh- bots deployed to acquire training data could detect and comply
older permission, there would be no copyright infringement in with.114 However, much of the content available online has not
either country or within the territory of an EU Member State.110 been posted by the legitimate rightsholder, and the creation of
Under this conventional understanding of copyright law, an AI a market for the licensing of works would incentivize dishonest
provider could therefore ensure compliance with Union copyright actors to impersonate legitimate rightsholders. It is not clear how
law by making sure that none of their training data had been col- this approach would address the critical issue of verifying whether
lected from servers based within the EU, and that all training took or not the entity offering to waive the opt-out over a work is, in
place outside of the EU’s borders. fact, the rightful rightsholder.
Alternatively, some have interpreted the recital as meaning Another, potentially complementary, solution to these logisti-
Downloaded from https://academic.oup.com/jiplp/advance-article/doi/10.1093/jiplp/jpae102/7922541 by guest on 27 February 2025
that GPAI models will be barred from entry into the EU market for cal challenges would be the establishment of a collective rights
failing to respect opt-outs under Article 4(3) of the CDSM Directive management (CRM) organization in order to manage the con-
even when their training has taken place entirely in a jurisdiction ditional waiver of the Article 4(3) opt-out.115 CRM has been a
which does not permit such a reservation of rights.111 In this sce- successful means of clearing the rights associated with large
nario, an AI product could be excluded from the EU market in numbers of individual works with highly fragmented owners in
order to protect the interests of rightsholders, despite no action- other areas, most notably the music industry.116 A centralized
able copyright infringement having ever occurred. This outcome CRM organization for the management of training data licences
would be highly unusual from the perspective of copyright the- would also provide a means to verify the legitimate rightshold-
ory. João Pedro Quintais reasonably points out that it would be ers of works. However, the number of works involved in training
problematic for something as radical as the de facto extraterrito- a large AI model far exceeds even the biggest repertoires of works
rial effect of copyright to be introduced ‘through the back door’ currently managed through CRM.117 The sheer volume of con-
of a non-binding recital.112 If the first interpretation of the recital tent involved in the training of generative AI models also means
holds, however, one of the main impacts of Article 53(1)(c) is likely that it is extremely difficult to devise any remuneration system
to be that it heavily incentivize AI developers to ensure that the that could yield significant payments to individual creators—
training of their models takes place outside of the EU. particularly sums that would adequately compensate for the loss
of business that many creatives fear will be the result of increas-
ing use of generative AI.118 Furthermore, while existing CRM
4.2.3 Technical and logistical issues relating to the imple-
organizations tend to govern licences for one particular type of
mentation of Article 4(30 of the CDSM Directive
work—such as pieces of music—an organization dedicated to the
Even if both of the issues discussed above can be overcome, how-
CRM of training data would have to manage a very wide variety of
ever, the emergence of a viable market in which authors receive
works, from books to songs, photographs, videos and social media
meaningful compensation for the use of their works in AI train-
posts.119 Given the complexity and scale involved, such a CRM
ing data remains unlikely due to the inherent flaws of Article 4(3).
organization would likely need to be directly established by a gov-
A significant obstacle, of course, is the aforementioned lack of a
ernment, or at least require substantial government backing.120
widely accepted protocol for the reservation of rights under Article
Neither appears to be in the offing.
4(3). Beyond this, however, a major and possibly insurmountable
logistical barrier remains. As noted above, the transaction costs
4.2.4 The likely impact of Article 53(1)(c) and (d)
associated with negotiating a licence fee with rightsholders for
such a large volume of works would be prohibitive. Given the While the issues discussed above cast doubt on whether Arti-
sheer quantity of works involved, even a minor transaction cost cle 53(1)(c) and (d) will provide any meaningful material benefit
per work is likely to be enough to render any approach based on to individual authors, there is little doubt that complying with
the negotiation of licenses with individual rightsholders entirely the provisions will impose additional costs on AI developers. This
unfeasible.113 could further hamper the EU’s already comparatively underpow-
Some potential solutions to this problem have been offered. ered AI industry, an issue that ties into a broader concern that
For example, it is possible that an automated licensing system the AI Act will leave Europe’s tech industry ‘hiring lawyers while
could be developed, with rightsholders expressing the terms (e.g. the rest of the world is hiring coders.’121 Moreover, the provisions
may also discourage AI developers from launching AI products in
the EU, as the required disclosures regarding training data sources
of the training data are somehow retained in the final AI model, this is likely to cause could also assist rightsholders in bringing claims and negotiating
major issues under EU law; Szkalej and Senftleben (n 45), 9. licence fees in other jurisdictions.122 It is therefore possible that
110
Peukert (n 86), 505–06. the transparency provisions of the AI Act could produce a ‘lose-
111
Lutz Riede, Oliver Talhoff and Matthais Hofer, ‘The AI Act: Calling for Global
Compliance with EU Copyright?’ (Freshfields Bruckhaus Deringer Technology Quo-
lose’ scenario in which developers are deterred from launching
tient, 5 April 2024), Available at https://technologyquotient.freshfields.com/post/
102j4jw/the-ai-act-calling-for-global-compliance-with-eu-copyright (accessed 1 Novem-
114
ber 2024); Maureen Daly and Sarah Power, ‘European Council prepares for debate ibid.
115
Stanley Besen, ‘An Economic Analysis of the Artificial Intelligence-Copyright Nexus’
on copyright under AI Act’ (Pinsent Masons Out-Law, 15 July 2024), Available
(2023) TechREG Chronicle 3, 8.
at https://www.pinsentmasons.com/out-law/news/eu-council-prepares-for-debate-on- 116
See further Daniel Gervais (ed), Collective Management of Copyright and Related Rights
copyright-under-ai-act (accessed 1 November 2024); Christian Frank and Gregor Schmid,
(2nd edn Kluwer Law International Alphen aan den Rijn 2010).
‘AI, the Artificial Intelligence Act & Copyright’ (Taylor Wessig, 13 May 2024), Available 117
Besen (n 115), 8.
at https://www.taylorwessing.com/en/insights-and-events/insights/2024/05/ai-act-und- 118
de la Durantaye (n 38), 11.
copyright (accessed 1 November 2024). 119
Besen (n 115), 8.
112
João Pedro Quintais, ‘Generative AI, Copyright and the AI Act’ (2024), 13. SSRN: 120
ibid.
121
Available at https://ssrn.com/abstract=4912701 (accessed 1 November 2024). Javier Espinoza, ‘Europe’s Rushed Attempt to Set the Rules for AI’ (Financial
113
Martin Senftleben, ‘AI Act and Author Remuneration - A Model for Other Regions?’ Times, 16 July 2024), Available at https://www.ft.com/content/6cc7847a-2fc5-4df0-b113-
(2024), 10. SSRN: Available athttps://ssrn.com/abstract=4740268 (accessed 1 November a435d6426c81 (accessed 1 September 2024).
122
2024). Senftleben (n 113), 12.
Adam Buick ⋅ Copyright and AI training data 11
new AI models in the EU, AI development shifts outside of the bloc, the use of copyrighted content by AI developers in the EU, the
and rightsholders receive no additional compensation.123 transparency provisions of the AI Act are unlikely to provide any
Such an extreme scenario seems unlikely, however. This is meaningful improvement to the material condition of individual
partly because AI developers have the option of concluding authors.
licenses with organizations with large portfolios of works, such as To be clear, none of this diminishes the advantages of require-
publishers, which many already do.124 These agreements would ments for training data transparency. However, policymakers
simplify compliance with Article 53(1)(c) and (d); AI developers around the world must now turn their attention beyond such
could cite the works included in the licensing agreement in their requirements to the difficult task of how and to what extent the
training data summaries, while Union copyright law would be law of copyright should be amended to balance the interests of
respected through the terms of that agreement. Publishers and the various groups impacted by generative AI. There is no ‘one-
other gatekeeper organizations will be incentivized to impose size-fits-all’ solution here: the nature of how best to achieve this
Downloaded from https://academic.oup.com/jiplp/advance-article/doi/10.1093/jiplp/jpae102/7922541 by guest on 27 February 2025
standard contract terms requiring authors to waive their right to will vary depending on the specific legal, economic and cultural
opt-out of TDM to facilitate further such agreements; multiple context of a given jurisdiction. In some cases, further measures to
commentators have noted the asymmetry in bargaining power protect rightsholders may be appropriate. In others, particularly
between authors and such organizations means that it could those at risk of falling behind in the global AI race, the priority may
become difficult for professional authors to refuse such terms125 be to ensure that copyright law does not restrict the development
The most likely outcome of the transparency provisions of the AI of a domestic AI industry. Policymakers around the world should
Act may therefore be that the providers of generative AI models engage closely with key stakeholders to determine the most effec-
conclude licensing agreements with publishers and other orga- tive policies for their specific contexts. Given the rapid pace of AI
nizations with access to large bodies of high-quality content in development, these policies should be frequently reassessed to
order to meet their obligations under Article 53(1)(c) and (d).126 It ensure that they remain relevant and effective. This must be man-
is unlikely that much of this licensing revenue will reach individ- aged amid both the sometimes-exaggerated hype regarding AI’s
ual authors, especially given that the amounts of money involved economic potential, as well as a growing backlash against AI tech-
in the deals struck between publishers and AI developers to date nology from the public, particularly those employed in creative
are relatively small given the large number of works covered.127 industries.
It has been suggested by some that the challenge of genera-
tive AI is so profound as to herald the end of copyright law.129
However, many new technologies—for example, radio, cassettes,
5. Conclusion—transparency to the rescue? home video, and especially the internet—have prompted prema-
This article has demonstrated that, while requirements for train- ture predictions of copyright’s demise. While the law of copyright
ing data transparency have a number of clear benefits, much will undoubtedly be roiled by the fundamental questions raised by
of the impact of such requirements is dependent on local copy- generative AI for years to come, it is likely that it will ultimately
right law—leading to widely varying outcomes between different adapt, just as it has with previous technological advances. This
jurisdictions. Such requirements do not and cannot resolve the does not mean that policymakers should be complacent; rather,
complex challenges surrounding the use of copyrighted materials decisive action is needed now to ensure that the correct balance
to train generative AI models by themselves. is struck between the incentivisation of innovation and protec-
This is clearly illustrated by the transparency requirements of tion of rightsholders. Crafting effective policy to regulate a new
the EU’s AI Act. As noted, there are a number of outstanding ques- technology during the early stages of its development is particu-
tions regarding the meaning of these provisions which will be larly difficult, as this is when the least is known about its societal
clarified in the coming months and years. However, it is already impacts. Yet, as the technology develops and its consequences
clear that reliance on transparency requirements, supplemented become clearer, it also becomes more socially and economically
with a requirement for a policy to respect Union copyright law, entrenched—with the result that implementing policies to control
is a misguided approach to the drafter’s presumed goal of ensur- the technology is much more difficult.130 The issue of entrench-
ing that individual authors are compensated for the use of their ment is especially pertinent to generative AI, as major tech com-
works in AI training data. Such requirements were never going panies are rapidly integrating these systems into widely used
to overcome the inherent logistical challenges posed by imple- applications. It is therefore vital that policymakers act thought-
menting the CDSMD opt-out. A better, although more challenging, fully but swiftly to ensure that copyright law develops in a way
approach to achieving this aim would have been instead to focus that balances the interests of all stakeholders fairly. While train-
on creating new legal mechanisms that would avoid the issues ing data transparency is an important tool in this effort, it cannot
associated with Article 4(3) CDSM Directive.128 Because they do rescue us from the difficult questions of how this balance should
not address the fundamental flaws in the existing framework for be achieved.
123
ibid.
124
Lopatto (n 9).
125
Ziaja (n 16), 455.
126
Quintais (n 90), 17
127
Lopatto (n 9).
128
Alternative to such mechanisms has been proposed. For example, Martin Senftleben
has suggested that instead of requiring prior authorization from a rightsholders or pro-
viding an opt-out, AI developers could instead be required to pay a compulsory levy for
the use of copyrighted works—which would ensure that compensation was passed on to 129
Alex Reisner, ‘Generative AI Is Challenging a 234-Year-Old Law’ (The Atlantic, 29
individual authors while avoiding the transaction costs associated with managing indi- February 2024), Available at https://www.theatlantic.com/technology/archive/2024/02/
vidual opt-outs and licenses. See further Martin Senftleben, ‘Generative AI and Author generative-ai-lawsuits-copyright-fair-use/677595 (accessed 1 November 2024).
130
Remuneration’ (2023) 54 International Review of Intellectual Property and Competition Law See further David Collingridge, The Social Control of Technology (Francis Pinter London
1535. 1980).
Downloaded from https://academic.oup.com/jiplp/advance-article/doi/10.1093/jiplp/jpae102/7922541 by guest on 27 February 2025
Journal of Intellectual Property Law & Practice, 2024, Vol. 00, No. 00 , DOI: https://doi.org/10.1093/jiplp/jpae102, , Article
© The Author(s) 2024. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs licence (https://creativecommons.
org/licenses/by-nc-nd/4.0/), which permits non-commercial reproduction and distribution of the work, in any medium, provided the original work is not
altered or transformed in any way, and that the work is properly cited. For commercial re-use, please contact reprints@oup.com for reprints and translation
rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site–for further
information please contact journals.permissions@oup.com.