AI x IP (Intellectual Property): The Trial of AI- Ivo Emanuilov

Introduction

I'm super excited to be here today with you.

And what we're going to be talking about is AI and IP, whether AI loves IP. I can tell you it doesn't. But IP doesn't love AI either.

But what's interesting is that over the past, I think, two, three years, AI has actually been on trial. And I'm not joking. That's how we've come to refer to the ongoing litigation process. against many AI companies. It's really collectively referred to as the trial of AI.

Key Issues in AI and IP

What are we going to be talking about today? So I want to touch upon four main issues. So the first is

What does it mean to develop AI in the age of software regulation? And I'll explain in a moment what I mean by that.

And then we're going to be looking at the AI and where it stands as input in the sense of intellectual property law and its output. So we're going to be looking at what it means to use training data to create models and what it means to generate artifacts and where they stand in the system of intellectual property.

And finally, I'm going to briefly mention, briefly really, a few provisions from the AI Act that concern copyright. We can only do this briefly because I think we could dedicate a whole session of three and more hours about the AI Act.

AI in the Age of Software Regulation

So AI in the age of software regulation. So, software has traditionally not been subject to much regulation, to be honest.

Of course, we've had regulated software industries, things like avionics, for instance. We don't want planes falling out from the sky, so that's why we have DO-178C, which is a standard that imposes how avionics software should be developed. But besides safety-critical applications, software has really not been on the radar of legislation for a very long time, for the past 60 years, essentially.

And in the past decade, it has actually become so much regulated that I would confidently call it a regulated industry. So the very process of software development is under strict, severe regulation.

Recent Litigation and Cases

So you may remember this. Clearview AI case that we had a few years ago. They were collecting basically a biometric database that was used to identify people with a high degree of confidence.

Then we had AI suggesting new possible chemical weapons. That was followed up by Google being hit with class action lawsuit over data scraping, and that's been something that many companies have been facing in the past few years.

This was followed up by the famous Getty versus Stability Eye case in London, which is currently undergoing trial and where, in fact, a colleague of mine filed the the reply basically to the petition of Getty just a week ago, and I've been reading through it, and it's very interesting to see what they've argued in this case, but we don't have time to go into the details. I'd be happy to explain for those of you who are interested.

We then had the US Copyright Office just last year denying protection for AI-created images. People have been trying to obtain protection of AI output, and it's not really been working out. quite as well.

The Times, the very famous Times case, so that has been suing OpenAI and Microsoft over the use of copyrighted word. In fact, many people in the field believe that this is going to be the copyright case to decide, at least in the US. where AI stands in the system of intellectual property, at least insofar as copyright is concerned.

And then we had the robocalls, people getting calls from basically robots pretending to be President Joe Biden. This prompted, by the way, for the first time in so many years, a complete ban on robocalls in some of the US states. So that was quite unusual, I would say.

And then we had Taylor Swift, of course, and the explicit deepfakes. And the one that actually had financial consequences just a few weeks ago, a finance worker paid out $25 million after a video call with the chief financial officer.

So see, there are actual real harms that are occurring, which has prompted the question, does AI exist in a legal vacuum? Does the fact that these things occur, that these things happen, mean that we do not have any effective way of navigating these harms or dealing with them? It's not the case. AI does not exist in a legal vacuum.

Existing AI Regulations

There is actually quite a bit of regulation that already exists. This is just a list of some of the most notable instruments in the field so the the one that has gone quite a bit of attention did i yeah yeah that one it's not working anyway yeah i was holding it yeah there we go so that's the ai act the payments here so that's the version from the 26th of january And this version is considered the final version of the AI Act, which is going to be in plenary vote next week actually in Brussels.

So we expect it to enter into force at the latest by April, which means we are going to have an actual effective regulation concerning AI Act. I was actually concerned that at some point this was going to be, it's actually not really AI Act, it's Machine Learning Act. But there was a time when I was concerned that this might become a large language models act. originarity for the AI Act. It didn't. It's more nuanced now. I'm actually a bit more happy, a bit happier with the current version, but it's far from perfect.

But besides the AI Act, we have Digital Services Act. By the way, Digital Services Act regulates some of the concerns that Ivan talked about, your German clients. Yes, they had a reason, because the Digital Services Act mandates certain transparency. If you're not transparent in your interface or in your recommender mechanisms, you need to have, you may actually be subject to liability and you need to really explain to your customers what's going on. There's actually even a mechanism under the Digital Services Act which allows researchers, vetted list of researchers to get access to confidential data and interfaces and code to scrutinize what is going on, and to be sure that the claims made by companies actually make sense. Companies like Netflix are subject to this regulation. That's what I mean when I say that software development is really a regulated activity nowadays. And it's not just GDPR, and it's not just Cybersecurity, or Data Act, or the AI Act.

Speaking of intellectual property, Do you know that this is the image that actually triggered the whole Getty versus stability case? Looks quite similar, right?

In copyright, we call this pastiche. And it's the main argument that my colleague Toby Bond from Bert and Bert in London argued in the case for the defense of stability AI. Pastiche is not a very well-defined concept in copyright law, but you could think of it as a stylistic imitation, basically.

The thing is that in the line of exceptions for copyright law, pastiche is listed together with parody and caricature. And I don't think that was the reason, that was the main driver for stability AI to create its stable diffusion model.

And then that's how the creative industry sees this whole thing. And I have to admit, rightly so, in some sense.

Guide to Suing an AI Company

So I've created for you a guide on how to sue an AI company.

That actually comes from, it's not made up. It comes from the litigation of Getty versus StabilityA.

So you could start, for example, with claims of infringement. Infringement of copyrights subsisting in the copyright works that have been used to create embeddings. All sorts of data. Infringement of database rights subsisting in databases used for the creation of embeddings.

So all sorts of databases created by Getty over the years, yes. Infringement of trademarks and acts of passing off. Why trademarks? Well, because here you can see the trademark of getImages. And this is a bit of distorted, but it still counts as passing off, if you ask me. Or it could.

You could go for requests, for injunctions. So injunctions means basically restraining somebody from doing something. So you could prevent them from. Downloading, storing, copying, communicating to the public, authorizing acts of reproduction, importing products or services, possessing products or services, extracting or utilizing a substantial part of the contents, providing images or digital imaging services, and so on and so forth. The list goes on.

And that's not all, by the way. Claims that we covered. Order for delivery upward destruction. Yes, it goes as far as asking models to be distracted. It's called model disgorgement. It's actually a policy tool that the Federal Trade Commission in the US has been leveraging recently. And there are good reasons to believe that under data protection law in Europe, we also have model disgorgement as a mechanism, actually, for fighting AI companies. Inquiries to damages, of course. You could ask damages for all of this in the amount of billions, even.

Key Questions for IP Lawyers

So briefly to the three key questions that stand for any practicing IP lawyer really these days. The first question is, when does reproducing copyrighted works as training data for generative AI constitute copyright infringement of these works? So you take data from the internet, you basically scrape the entire web, you create embeddings out of it, then you start training. Is that allowed? Is that infringement?

Second question, when, if at all, Do works generated by AI constitute adaptations or arrangements? I will explain what this means, but many people have been arguing, especially in the creative industries, you've taken my intellectual property, the fruit of my labor, and you've used it to create a model that generates bullshit most of the time, and yet it deprives me from my well-earned, well-deserved revenue. So people have been very disappointed, really, and unhappy. The question is, is this really the case? Are these generated works, if they're even works, adaptations of the training data?

And then who, if anyone, has copyright in these AI-generated objects that are even subject to copyright? My reasoning is, yes, they are. And I think this whole debate about AI authorship is nonsense, really. But I'll get to that in a moment. And it has to do with things both Sergei and Ivan mentioned earlier.

So this may... Yeah, I've tried to break down this Article 4 of the Copyright Directive.

So the Copyright Directive was adopted in 2019, and as of a few months ago, it has been transposed into our Bulgarian Copyright Act, so now it's actual law in Bulgaria. So we have this. exception or limitation for text and data mining. Now, text and data mining sounds very much like 2010, I know, but it's the term that the legislature has used to describe, here's the definition, any automated analytical technique that is aimed at analyzing text and data in digital form in order to generate information. So it's everything really that has to do with data.

So what is this exception or limitation? Why do exceptions and limitations exist? They exist because in copyright and the copyright law, copyright is an exclusive right, which means you create something and you can prevent anybody else in the world from doing anything with your work unless they get an authorization from you, so a license. Legislators already in the 19th century, when the Berne Convention was adopted, realized that it's not going to work for every case, and it doesn't really strike a good balance between the author's exclusive rights and the public benefit. So society should also benefit somehow from the creative talent of authors. So that's why exceptions and limitations were created to copyright, which means that you can do these things without asking the author. Certain things are allowed to be done without getting an authorization, without getting a license.

So now in European Union law, we have this exceptional limitation for text and data mining. And what does it say? It says that a member state shall provide for an exceptional limitation to certain rights. So it's not a blank check. You cannot just accept each and every copyright relevant act, but just certain of these acts. So first is temporary permanent reproduction by enemies and in any form, in whole or in part. So that means you can actually download the data, you can use it to train models, and that's fine. You're excused under this provision.

Secondly, you have direct or indirect temporary or permanent reproduction by enemies and in any form in whole or in part of databases. So this is the database directive. It's over here actually. So temporary or permanent reproduction is always excused. Then you have permanent or temporary reproduction translation adaptation. So adaptation is also excused but Note that it also only concerns computer programs because this is a computer program directive. So now, this is a bit speculative, but many people believe that this is the so-called Microsoft addition to the copyright directive. Why? Because around that time they released their codex model. which in the early stages were part of the GitHub co-pilot service when it was first rolled out in beta. So they benefited actually from this. But the right of adaptation is not harmonized in the EU, which means that it only applies to computer programs. And that obviously benefited Microsoft at the time.

But you can only benefit from the exception for adaptation or translation or arrangement for computer programs. You cannot do this with any other Now, I don't know if you noticed here, but I've highlighted this. This is only allowed for lawfully accessible works. So you must have a way to prove that you had lawful access to these works. And what does that mean?

I did not mention that besides Article 4, there is also an Article 3, which concerns research organizations. So for research organizations, there is a similar exception, but it's targeted specifically for them. So for research organizations, they can freely, lawful access is considered access based on open access policies, access based on contractual arrangements, which is on the natural an agreement with a data provider and then you have, obviously, local access to the data. It can also be through other local means or freely available online. So for the freely available online, it's interesting.

Can you download Libcan, the entire Libcan, and use it? Is it through a freely available online source? Yes, if you ask me.

But the thing is, that these types of allowed access only concern research organizations. So this recital of the directive only specifies that these conditions apply to research organizations, not to the Article IV exception, which concerns also commercial use.

So if you're building commercially I'm not sure whether you can actually benefit from this freely available online exception. And in fact, I'm pretty sure that you must be able to prove that you not only had local access, meaning you did not overcome any technical barriers, meaning security, but also that you had a license of some sort to use these works.

So for the first three, it's fine. For the fourth, I'm not so sure. Now, the copyright directive also provides that reproductions and extractions may be retained for as long as it's necessary for the purposes of text and data mining.

Now, what does that mean? When it was first drafted, it meant something very simple. You download your data set or you curate your data set. you train your model, and you no longer need that data set, presumably.

Now you do, actually, because you have transparency obligations under the AI Act, so you actually have to be able to provide information about the provenance of your data. But at the time, that was the case. So they said, you can only keep these reproductions and extractions for so long as it's necessary for the purposes of text and data mining. But let's not forget that the purposes of text and data mining also include

generation of information. So many people believe that this also covers generative AI, for example, because what does it do? Well, it generates information. In effect, it's a perpetually running obligation.

So for so long as you need your model to perform its task, you are going to be covered by this. So you will still need it for the purposes of text and data mining. Maybe that's not how the legislature intended it, but it's how it actually works from the literal reading of the provision.

So I would not be very concerned by this provision. Now, you can see here a table that colleagues of mine created, which has studied something else that's in the directive.

And that's that these exceptions or limitations only apply on condition that the use of works has not been expressly reserved by their right holders. Meaning, as an author, I can say, I opt out of this. I don't want you to use my works for training AI. And that's a valid choice.

Now, how this is going to be done? Presumably, it should be done in a machine-readable way. We expect that there should be a standard at some point. The standard is not there yet.

You know, there have been discussions like, we need something like robots.txt for AI. This is in the direction of opt-out, of express reservations.

But this study here shows how many of the popular publishers and then image provider companies have actually considered this. So you can see that most publishers have addressed text and data mining. They allow users for research purposes. They also allow storage. And in some cases, users must be through an API or in a secure network. So you can see it's a bit nuanced. I'll leave it to you to go into the details at your leisure.

Then for image providers, you can see things stand a bit differently. So for example, most of them address text and data mining. And again, most of them prohibit it. So that's express reservation. And they do not allow users for any additional purposes, meaning You cannot use our data, essentially.

Are Trained Models Protected?

Trained models, are they protected? There has been discussion, what is a machine learning model? Is it a computer program? Is it a database?

Is it an abstract mathematical model? But if you ask me, it's certainly not a computer program. The source code, to the extent we can even speak of a source code of a model, it's nothing like the source code of a computer program.

It's not human-readable, which means it's also not copyrightable. Because for something to be copyrightable, a human must be able to perceive it. There's also a good reason why computer programs are actually under the heading of literary works in copyright law. That's why code is poetry and so on.

There is a good reason why code is a literary work. And if we accept that a model is a computer program, then we should be able to perceive it somehow. We cannot.

The source code of a model means nothing. Model parameters are also determined in an automated process. There's no human intervention. The meaning isn't known. It's not easy to change. And you cannot just go and start changing parameters.

And it's also doubtful whether we can liken a model to a program, whether it's copyrightable as the result of creative activity. So for something to be copyrightable, it must be original. And to be original, it needs to be the result of the author's own intellectual choices. And in this case, at least for the model itself, I don't think that's the case.

So to me, this is, again, a discussion that's theoretically interesting but not really very practical. Models are not copyrightable.

The Issue of Memorization in AI

Now, a more interesting question is memorization. It's something that I've studied for quite some time. So what does it mean for a model to memorize? Here is a very good example.

So the fast inverse square root. So this is a very clever technique that was implemented by the developer of Quake 3 to solve a computational problem back in the 90s. And that's a prompt, an early version of Copilot. Somebody prompted Copilot with the simple prompt, fast inverse square root. And it generated verbatim the entire thing without, however, acknowledging that the license was actually GPL.

They said, I don't know if you noticed, the license that was generated here was actually a traditional copyright license. It was not GPL, which was a violation of the GPL, in fact. So you cannot just change the license in your own .

So the question with memorization is particularly relevant, because this is the only clear-cut case where you can argue infringement. If something is the same as the original work, then you clearly have a claim for infringement. There's no doubt about that.

The question is, we obviously have here this issue of the scope of Article 4 that I mentioned, whether it counts to only retrieval and generation of information, or whether it also covers extraction and generation of new objects. And keep in mind that none of these exceptions actually apply to the distribution. So that means if there are memorized data in your model, and you start selling your model to somebody else, or you start providing public access to it, this is not covered by the text and data mining exception. These are separate rights under copyright law, distribution and communication to the public, and they are not part of the exception or limitation.

There is a paper on memorization that we wrote with a colleague of mine. It's a lengthy one. It's an interdisciplinary paper, which actually focuses on the GitHub Copilot case. And we came up with a categorization of types of memorization that are relevant for copyright purposes. And we argued that there are certain cases of memorization that are necessary for generalization. We follow the long tail theory of Vitaly Feldman from Google, if I'm not mistaken. to argue that in some cases, memorization is actually necessary for your model to perform the task, which again brings me to the definition of text and data mining. And for how long can you keep reproductions and extractions? For so long as it's necessary for the purpose of text and data mining. The purpose of text and data mining, after all, is to create a useful model. If your model is no longer useful and you need to have certain level of memorization for your model to generalize well, then this may potentially be excused. So this is a defense. for models that have been proven to memorize and yet can be excused under the text and data mining exception. But again, check out the paper. It's for the fine-grained details.

AI as Output

Now, AI as output, again, as I said, I don't think that's a useful discussion.

We know the case of Kristina Kashtanova. So it's a funny case, actually.

She applied with the US Copyright Office to obtain a registration. So in the US, you can register copyright. Copyright typically does not need registration. I write a poem, I get copyright over it.

But in the US, there is a registry maintained by the Library of Congress, and they issue registration certificates. So you can, for example, use it as a timestamp. I created this at this date. It serves this purpose.

Kristina Kashtanova got the certificate, actually. She got the registration, and she went on social media and said, hey, today I got copyright from the Copyright Office of the US on the AI-generated graphic novel. And then somebody from the US Copyright Office went on Instagram and saw that, and he thought, wow, something's off here. I should investigate this.

So at their own initiative, they actually looked at the case, and they realized that the Copyright Office had erred. in their evaluation of the case, and they reversed their decision, denying registration of the copyright of Kashtanova, and coming up with the following recommendation. Authorship is always key. So you need to have a human, you need to have human agency in what has been generated. If you say, I did not create this, this was created by a machine, then forget about it.

You're not going to get copyright. So the thing she could have done is just, she could have said, I created this, and the case would have been done without an issue. And then the copyright office said, in applications, the best practice is to always distinguish between the copyrighted and the generated object. So say I created this part of the image, for example. Or for example, I created the initial version using a large language model, and then I fine-tuned it in this and this way. So it's as simple as that.

But the discussion, I don't think it's productive, really. It's a clear requirement. You need a human being. You need an author. to have copyright. Otherwise, there's no copyright. And it's probably in the public domain.

What does the AI Act say about copyright? Well, the copyright provisions only concern general purpose AI models. And general purpose AI models are defined as something separate from AI systems. We have two definitions, at least two definitions of these things in the AI Act.

If you ask me, these general purpose AI models are essentially the foundation models in disguise that we know from the paper of Bomasani and many others from Stanford of 2021. Now, I have a problem with... with general purpose AI models, and that problem boils down to the way they are defined. So they say a general purpose AI model is different from any other model because of its generality and its capability to competently perform a wide range of distinct tasks.

Well, I can show you many examples of ChatGPT being utterly incompetent to deal with a wide range of distinct tasks. Does that make it any less a general purpose AI model? Not really.

And they say, if you want to have a system, then you need to add some additional components. For example, a user interface, and then it becomes an AI system. Now, I have a question.

If I download an open source model on my computer and I run it with LM Studio, for example, it provides a user interface. Is that now a model? Or is it a system? What is it exactly?

So see, there are things that they really did not think through. And then there is a symbiotic relationship in the regulation between general purpose AI models and compute. What do I mean by that?

Well, here is a visual that I created for you.

So there are two types of GPI models. One are conventional AI models that can be proprietary or open source. And if they're open source, you don't have transparency obligations. you only need to provide for disclosure of the training content and the copyright policy.

So lightweight applications. And then you have so-called GPII models with systemic risks. And there is a presumption. If more than 10 to the 25 floating point operations have been used to train your model, then the model is automatically considered to be a GPII model with systemic risk. Why 10 to the 25? Because GPT-3, I believe, was trained, was estimated to have required 10 to the 24 floating-point operations, so they set the bar at 10 to the 25. It's a very weird qualitative, quantitative, sorry, criterion to determine whether something creates systemic risk.

And by the way, what is a systemic risk? A systemic risk is something that has high impact, a model that has high impact capabilities. It can actually, it presents actual or reasonably foreseeable risk to fundamental rise in values, and it has the chance of propagation across the value chain. If you ask me, of all these three criteria, I think the last one is the only one that actually you can make sense of and you can actually measure somehow. Because if a module becomes quite popular, let's say an open source model becomes integrated in all sorts of applications across many different industries, then you can clearly have a risk of bias, for example, propagating across all these applications. So that's measurable.

That's reasonable. I don't know how we're going to measure without guidance. the actual or reasonably foreseeable risk to fundamental rights. This is extremely difficult. Measuring risk to fundamental rights is a daunting task, let alone the high-impact capabilities that are linked to generality and the ability of the model to competently perform tasks. So be my guest with explaining when exactly would a model be a model considered to present systemic risks.

But the interesting part is not really in the quantitative criterion. It's in the power vested with the European Commission To decide individually on a case-by-case basis whether a model is going to be considered a general purpose AI model with systemic risks, meaning For example, there's going to be a board of, again, vetted researchers who are going to have the power to issue qualified alerts to the commission. Say somebody is scrutinizing a particular company, and they issue an alert to the commission saying, hey, I think you should designate the models of these guys as GPI models with systemic risks. Now, researchers have been selected as an allegedly neutral group, but come on, let's be honest. Much of the research community engaged in machine learning research also has a vested interest in their employers who are paying their salaries. So this can also be used as an anti-competitive tool. And I'm not sure how the Commission is really going to deal with this because the Commission itself does not have the competence to assess this. They're really going to rely on the qualified alerts of the research community.

For general purpose AI model providers, let's assume you're one, what does it mean?

What would put you under an obligation to comply with the AI Act. Well, so long as you place on the market or put into service an AI system that, or a general purpose AI model, you have to comply with the obligations of the AI Act. It doesn't matter if you've established your company outside of the European Union, so long as you bring it to the European Union, so long as you sell it or offer it to customers in the European Union, you are under an obligation of the AI Act and you need to at the least appoint an authorized representative, even if you do not have a legal entity in the European Union.

So that's one. There is a requirement for all models to disclose technical documentation, information for integrators, to provide copyright policy and a detailed summary of the content that's been used for training. So this detailed summary does not have to be a list of every data set and every piece of data you use, it really has to be a narrative explanation. I've used these publicly available sources, so I've used these proprietary sources that have been given to me under a license agreement and so on and so forth. So it has to be really a narrative explanation.

So one and two, technical documentation and information for integrators does not apply to general purpose AI models that have been made accessible to the public under a free and open source license. The problem here is that we don't know what an open source license for AI means. So open source licenses were created for software. And AI is so much more than software. So, for example, what does it mean to provide the source code in the most appropriate form, like what the GPL requires? I don't know.

Yet the Commission, for the first time actually in legislation, decided to refer to public and open source licenses without anybody having a definition on this, and we have a process ongoing at the Open Source Initiative, I've been contributing to it to come up with a definition of what Open Source AI should be and whether we need new sorts of licenses for Open Source AI or we can deal with the old ones. So that's one. And second is that the Commission and the Parliament and the Council introduced the requirement that the models should be made publicly available. But that's not the requirement of any Open Source license. You must provide the source code of your program to anybody who asks. But there is not really an obligation to provide this publicly to anybody. So people have been uploading their source code to repositories, but it's purely voluntary. It's not the requirement of any license. You must provide it upon request, but you don't have an obligation to make it publicly available. And yet, now in the law, you're actually required to make it publicly available.

Then the what. I said that for systemic risk, there are some additional obligations. So you must conduct model evaluations, for example, using adversarial testing, assessment and mitigation of resources, tracking and reporting of information, cybersecurity, reliance on codes of conduct. That's, by the way, going to be the main tool of compliance in the early days. And then the AI office, which is a unit within the European Commission, is going to adopt a harmonized standard by which every AI company should abide.

And when should I panic? You should not really. You shouldn't panic. But the thing is, obligations for providers of general purpose AI models should apply within 12 months from entry. So that means by March 2025, these obligations will be enforced for everybody who is putting GP AI models on the market or into services and products. There are some fines, obviously. But these fines will only be applied one year after the entry into application. So entry into application is 12 months from the adoption, meaning March 2025. And then first fines will be issued at the earliest in March 2026. So that gives you roughly two years of, I'll call it a pre-license to do whatever you want, or sort of, to get in line with the requirements.

Conclusion

So the AI is on trial. The jury is still out, though, as to what the outcome will be.

Many have argued that copyright law is the only viable legal branch that can bring AI to its knees. I don't think so, to be honest.

I think we're going to be seeing more litigation coming up from the data protection perspective, litigation coming up from personality rights, or, for example, models that have been trained using somebody's voice. Some people are not happy with their voices being used for training. For example, creating digital personas. Not everybody would be happy with that.

So we'll see cases coming up from very many different angles. And I'm not sure where we're going to end up in the end, but I'm sure of one thing, that litigation will continue.

I know it was too much probably for a Tuesday evening, but I'm happy to take any questions during the discussion.

Finished reading?