Tell me in two minutes
The rise of generative artificial intelligence (GenAI) brings with it a range of fundamental copyright issues that profoundly impacts the creative industries. Does the training infringe copyright and how do owners of copyrighted materials get compensated for use of their material? Who owns the output of a GenAI system? Does use of an output infringe copyright? Who has IP rights in deepfakes that resemble artists?
Some of these questions involve difficult balances of interests of competing groups. But tech companies and users can and are starting to implement practical steps, such as data filtering and attribution processes, and ‘indemnification promises’, to lessen the risk of copyright claims arising out of GenAI.
This article is part of KWM’s series on the risks of GenAI and examines the complex interplay between GenAI and copyright. Find the other articles here.
What are the key copyright issues?
As we have previously considered, when looking into this issue of copyright and GenAI, there are three primary issues to consider:
What is the issue / risk?
Training
GenAI models are created by training artificial neural networks (a particular type of machine learning model) on large volumes of data, such as text, images, videos and code. These models can then be used to generate new content.
There is a strong correlation between the amount of training data used to train neural networks and the performance of these models. As a result, developers of GenAI models require vast quantities of training data in order to train models that produce the best results in response to a user’s requests.
So how do operators of such systems obtain this sheer volume of materials?
Training data and materials for GenAI systems often include materials available on the Internet. Although developers often filter the training data to remove things such as spam and erotic content, many models have been trained without regard to the copyright status of the training materials, raising a number of copyright issues.
We are already seeing this issue play out in the courts. As we have previously considered, Getty Images is currently suing Stability AI, claiming that Stability AI has infringed the copyright of more than 12 million photographs, captions and metadata in training their Stable Diffusion and DreamStudio products. Similarly, in the United States, a class action has been brought against GitHub, Microsoft and OpenAI, with anonymous plaintiffs alleging that the defendants utilised their copyrighted materials to create Codex and Copilot, with claims that the creation of AI-powered coding assistant, GitHub Copilot, constitutes ‘software piracy on an unprecedented scale.’
Closer to home, there haven’t been any lawsuits filed in Australia in relation to claims of copyright infringement, yet. However, Australia will be watching closely, as cases like the ones mentioned above, play out in the UK and US courts. Fundamentally, the cases put forth an argument that copying works that are protected by copyright for the purposes of ‘training data’ may be an infringement of copyright since these materials are being used without the copyright owners’ permission. In defence of this, an argument in the US based case, concerning how copyrighted materials for training data fall within the ‘fair use’ doctrine has been raised. Whilst we must wait to see the efficacy of such an argument, as Australia’s ‘fair dealing’ exceptions are much narrower than the US’s ‘fair use’ exceptions, a successful defence to copyright infringement in the US will not mean the same will be true in Australia. In fact, the differences in law between jurisdictions mean that the territorial reach of each jurisdiction’s legislation may become a significant factor in copyright disputes and may be a factor in how developers train and deploy these models.
However, this copyright risk is beginning to be considered by AI service providers and users. For example, Microsoft have commenced “filtering out” of their training data materials that are protected materials i.e., those protected by copyright, and as can be discerned by the News Corp and OpenAI Global Partnership agreement whereby OpenAI are able to use current and archived news content from The Wall Steet Journal, The New York Post, MarketWatch and Barron’s (among others), OpenAI have begun to enter into licensing agreements with owners of copyright.
But will we run out of data?
A big concern for tech companiess running out of data. While entering into licensing arrangements has the potential to solve copyright worries, the huge demand for data might make it impractical to obtain licenses for all materials required, so it seems that licensing might not be a good solution after all.
To feed this ever-growing demand, some companies are turning to synthetic information, which is data produced by AI models. This use of synthetic data has issues of its own – opinion is divided as to whether synthetic data is useful or will eventually lead to “model collapse.” In fact, a recent study by researchers in the UK and Canada found that where systems are trained on model-generated content, its outputs become increasingly wrong and homogenous, and that even in the best learning conditions, model collapse was inevitable.
Clearly, it is quite the balancing act to ensure a GenAI system has enough data to be able to train itself to produce effective responses and content, and to ensure that copyrighted material is not being used improperly.
Produced Materials in Outputs
A further consideration is whether copyright subsists in materials which are produced using GenAI systems. In Australian law, copyright materials must have originated from a human ‘author’ who has applied a sufficient amount of ‘independent intellectual effort’ to authoring the work. As GenAI systems are trained to produce outputs which reflect training data, such produced materials in output are arguably not novel or inventive. Thus, there is ambiguity about how this originality threshold is satisfied.
We expect courts will question the knowledge, complexity or skill used to “prompt” a GenAI system to generate a work, and the level of human intervention used to augment any output. This question is not only relevant for developers but for those organisations who utilise or monetise works created by GenAI – if copyright doesn’t subsist, such organisations are unable to adequately protect it and prevent others from copying the work. As advances in the use and capability of GenAI continue, this issue in relation to produced materials in outputs and copyright will become increasingly important.
Music deepfakes
A song purportedly by Drake and The Weeknd created a flurry of internet discussion and commentary when it was posted on TikTok and Spotify in April 2023. Within days of its posting, the song was removed from all platforms as a result of copyright claims by the artists’ record label. This rapid advancement of GenAI enabling the creation of music deepfakes also presents copyright risks.
Rights holders[1] believe that unauthorised datasets are being used to produce these imitations of artists. This imitation has the potential to directly dilute and damage the artist’s brand and livelihood. Not only is there this risk associated with the imitation, but more broadly, the ability to create these deepfakes also poses risk to the music industry. As deepfakes can be cheap and royalty-free, since no compensation needs to be paid to the writers, publishers, performers and record labels, music streaming platforms may be incentivised to allow deepfake music on their platforms. In their view, this has the real potential to diminish greatly the richness and diversity of Australian music available online.
Utilising Outputs
As seen in the New York Times case against OpenAI and Microsoft where millions of their articles were utilised to train chatbots, there is also a risk that where a GenAI output contains a substantive part of an existing work in which copyright subsists, an end user could unknowingly infringe a third party’s copyright. There are some protections that already exist in Australian law, known as ‘fair dealing’ exceptions. There are also technical exceptions that may apply, which includes the temporary ‘copying’ of works whilst one views them (for example, downloading a movie on Netflix to watch later). However, the applicability of such exceptions to GenAI is questionable.
To address this risk and to address user and customer concern, companies like Microsoft, Adobe and Google are offering IP indemnities in relation to the generated outputs. Essentially, this contractual promise applies where a user is challenged or sued on grounds of copyright infringement for an output—Microsoft, Adobe and Google will then assume responsibility subject to certain conditions and limitations. This is intended to put end users at ease and is increasingly common.
So what?
As discussed above, there are legal risks that are posed by the use of a GenAI system – not only for owners and operators of these systems but for users as well. As GenAI systems invariably need data to function, the risk of breaching copyright is potentially quite high – developers need to consider the copyright status of their training data and users need to consider their own use. The risk posed by utilising newly generated content and outputs, also presents anxiety over breaching copyright.
Now what?
There are important considerations and actions both users and operators of GenAI systems can do to mitigate potential copyright issues.
See for example, the AMCOS/APRA submission to the Supporting Responsible AI consultation here.
|
USER
|
OPERATOR / DEVELOPER
|
Example
uses 2
|
|
Carefully consider the risk of using GenAI outputs, and review any copyright commitments including any limitations. |
Be fully informed of where data is sourced from, and where required, seek permission from owners of materials protected by copyright. |
|
|
Consider the impact that use of GenAI will have on own IP rights to own materials. |
Consider filtering measures on training data to remove well-known copyright material or allowing copyright owners to request that materials containing their IP is removed from the training data set (i.e., opt-out mechanisms) |
|
|
Consider whether own practices should take into account jurisdictional differences. For example, the availability of the broader, fair use defence in the US. |
Consider traceable links to the copyrights owners. For example, GenAI systems could have known IP information coded into the data source and surfaced as a summary to the user. The user can then make a decision in relation to the content, or the GenAI system could prevent particular uses. |
|
Conclusion
The outcomes of current litigation concerning GenAI models occurring internationally have the potential to significantly impact the way in which data is sourced and trained, and how GenAI systems are developed in Australia.
There is a real risk that where GenAI is not trained in Australia due to copyright laws, Australia will always be ‘a taker of technology’ from large overseas companies. This will make it difficult to develop sovereign large language models and other GenAI systems in Australia. As we await the outcomes of these international cases and potential law reform in Australia, tech companies, operators and users should ensure they are employing best practices to minimise the risk of copyright infringement.
Stay tuned for the next update in our risks in GenAI series, with a focus on accuracy and reliability. Subscribe to data and technology newsletters here.
Getting lost in the changing landscape of AI regulatory requirements?
View our resources and videos developed by our experts to help you stay on top of the latest GenAI and tech developments.
|
Our GenAI regulatory map will help you to understand and keep up with this fast moving regulatory and stakeholder landscape. |
This easy-to-use and regularly updated timeline will help you stay on top of important developments across key areas of tech-related regulation, including GenAI. |
|
We are at a technological inflection point with GenAI. Its capabilities are improving rapidly almost daily and the potential productivity gains from the use of GenAI are dramatic.


