15.1 Prelude: Risks and Challenges of Generative AI
Now that the initial hype around generative AI in the form of large language models and image generators has subsided, legal issues are coming to the fore. In addition to discussions about generative AI and copyright, there is an increasing focus on the friction between generative models and the requirements of data protection law. In the United States, several lawsuits are underway against Google and OpenAI regarding potential privacy violations by generative models.Footnote 1 Regulators are currently active in the European UnionFootnote 2 and the European Data Protection Board (EDPD) has set up a task force to deal with ChatGPT.Footnote 3 The Italian data protection authority Garante had already opened a case against OpenAI in 2023, which led to a temporary national ban on ChatGPT.Footnote 4 After the proceedings were concluded, the authority found violations of the General Data Protection Regulation (GDPR).Footnote 5 Investigations into data protection violations are also underway in Poland.Footnote 6 Other countries such as Germany have issued requests for information,Footnote 7 and the French data protection authority has developed an action plan.Footnote 8 In the case of Maximilian Schrems, the data protection non-governmental organization (NGO) NOYB filed a complaint with the Austrian data protection authority in April 2024,Footnote 9 centred on incorrect information about an individual provided by ChatGPT, which OpenAI did not correct and nor did it respond to the request for information about what data was processed. These cases make it clear that data protection authorities are already AI regulators and generative AI is a core issue for data protection.Footnote 10
From a legal perspective, generative models introduce a range of distinct issues, which are well documented across various scholarly sources.Footnote 11 In particular, the foundation models on which the popular large language models (LLMs) are built pose new security risks and vulnerabilities that need to be addressed. This then gives rise to the need for a socio-technical assessment, including legal and ethical aspects, to understand these risks and the necessary safety mechanisms. Understanding the risks posed by LLMs requires a contextual approach: normative rules, like law, always operate in context.
A major concern is the protection of personal data and privacy. Different experiments have shown that it is possible to extract personal and sensitive information about individuals from LLMs.Footnote 12 Researchers have proven that LLMs are able to memorise training data, either through over-application of abundant parameters to small datasets, which reduces the capacity to generalise to new data, or through optimisation for generalisation in long-tailed data distributions.Footnote 13 Although this phenomenon most often occurs where duplicates exist in the training data, it still appears where training data has been partially deduplicated. Larger models with more parameters ‘remember’ more data than smaller models.Footnote 14 Violations of people’s privacy and right to data protection result from both incorrect information and correct information they do not want published.Footnote 15 These risks are exacerbated by unregulated and therefore uncontrolled secondary downstream use of the models. In the case of popular LLMs operated by global technology companies, commercial resale seems remote, as the companies have no interest in giving up their exclusive option for commercial exploitation. The situation is different for smaller, but in some cases no less risky models: Mixtral 8x7B competes with and surpasses GPT 3.5 in some respects, due to smart architecture that combines eight different expert models, and has been recently made open source.Footnote 16 This only highlights the need for an overview of the purposes for which these models are used and a categorisation to enable a context-based risk assessment.Footnote 17
Data protection law gives rise to its own particular frictions, from the general function and technical specificities of big data applications and generative AI on the one hand and, on the other, the particularities of generative models. Generative models are used in different contexts for different purposes, to generate text, code, video, images, audio, and so on. In this chapter, I will focus on LLMs, which generate text by calculating the probability of word order. Data that is linguistically translatable – that is, can be understood by human recipients – can clearly also contain personal data as covered by data protection law. For this reason, they are a good example of the problems of how data protection law works in relation to AI-generated content.
This chapter first outlines the overarching lines of conflict between data protection law and generative AI (Section 15.2). It then goes into the specific legal issues of the GDPR: the scope and legal basis for authorisation of different steps of data processing by generative AI (Section 15.2), the principles of data processing (Section 15.3), the rights of data subjects (Section 15.4) and questions of responsibility (Section 15.5). Section 15.6 discusses the transferability of the argument to models that create images, audio, and video. The chapter concludes with an outlook (Section 15.7).
15.2 Structural Challenges of Generative AI for Data Protection Law
Data protection law in the EU is primarily addressed by the GDPR.Footnote 18 The current system of the GDPR is rooted in the Data Protection Directive adopted in 1995,Footnote 19 the right to data protection (Article 8 Charter of Fundamental Rights of the European Union, CFR), the right to privacy (Article 7 CFR),Footnote 20 and the primary legal foundations in Article 16 Treaty on the Functioning of the European Union (TFEU).Footnote 21 Article 1 GDPR sets the matter and scope as the processing of personal data with the objective of protecting fundamental rights and freedoms of natural persons, Article 1 GDPR. Correspondingly, the understanding of ‘processing’ in terms of personal data is very broad and takes an all-encompassing approach to cover practically any interaction.Footnote 22 For this reason, when personal data is involved, all stages in the lifecycle of an AI model may fall within the scope of the GDPR.
From a regulatory perspective, the various steps of data processing in the lifecycle of an AI model are therefore important, and for generative models can be differentiated as follows.Footnote 23 The first step is the collection of training data, made up of many data points. These may comprise personal or non-personal information. In certain instances, this process utilises extremely large datasets, making it challenging, if not impossible, to differentiate between various categories of data. For instance, ChatGPT was developed using copious amounts of data freely available on the internet. The second step is the actual training of the model using the collected data, resulting in a configured model. The third step is model application, meaning that the trained model is applied to specific cases or individuals, making the model a tool that computes a specific output in response to input data.Footnote 24 This breadth of data and the training process mean model output contains information about cases or individuals as well as of ‘third parties’ that were not part of the training data.
15.2.1 Quantity
The first problem area relates to how the training of powerful AI models, or the processing of large amounts of data, works in relation to the amount of data processed.Footnote 25 The sheer quantity of data processed by AI models is the core, as yet unresolved, problem of AI and data protection.Footnote 26 Generative AI models are typically trained on billions, if not hundreds of billions, of parameters and require large amounts of training data and computing power.Footnote 27 Data protection law on the other hand is based on the idea that the individual steps of data processing and the data processed can be identified. This concept applies the idea of individual control to empower individuals by allowing them to manage their own personal information.Footnote 28 But models trained on unprecedentedly large datasets make it impossible to manually identify or even review whether data processing complies with legal requirements, and thus harbours potential for privacy and data protection violations.Footnote 29 Furthermore, this approach conflicts with the principle of data minimisation laid down in Article 5(1)(c). This mode of operation reveals the problems of governance arising from the systematic design of the GDPR, which, for example, envisages individual consent as the basis for authorisation and presupposes the identification of individual data subjects and the data to be attributed to them.
15.2.2 Purposes
Privacy and data protection also seem to be at odds with the general concept of generative AI when it comes to the relevance of purposes. Data protection is highly contextual, and its level of protection depends on the type of data processed, by whom, in which settings, and for which purposes (Article 5(1)(b) GDPR). LLMs on the other hand, cover a wide range of purposes, applications, and operating environments. According to Article 3(63) of the new regulation on artificial intelligenceFootnote 30 (AI Act), a general-purpose AI model includes AI models trained with a large amount of data using self-supervision at scale, which display significant generality, are capable of competently performing a wide range of distinct tasks regardless of how the model is placed on the market, and that can be integrated into a variety of downstream systems or applications. It does not include AI models used for research, development, or prototyping activities before they are placed on the market. This definition is a good description of the current market situation; OpenAI, for example, now offers a wider variety of different GPTs for specific tasks: the laundry buddy for laundry-specific questions about stains and laundry settings, the sous chef that provides users with recipes, or the negotiator that helps a user to argue in their favour.Footnote 31 These downstream applications will gain more relevance, as it can be expected that the foundation models will not continue to be used primarily as isolated applications as has been the case to date, but will be integrated into other models as modular building blocks. This will increase both desirable and undesirable effects due to the possible scaling of model output. Here, even the design aspect of LLMs is difficult to reconcile with legislation, seemingly conflicting with the GDPR’s purpose limitation principle. In particular, when models are made available to numerous third parties via an interface, ensuring compatibility for that model and its data with the purposes for which the data was originally collected (Article 6(4) GDPR) becomes difficult, if not impossible.Footnote 32
15.3 Scope of the GDPR and Legal Basis
The GDPR is applicable in terms of material and geography – that is, it extends to the processing of personal data for activities within the EU, even when that processing takes place elsewhere (Article 3(1) GDPR), and where goods or services are offered to data subjects within the Union (Article 3(2) GDPR). It therefore applies to all generative models in use in the Union
15.3.1 Scope of Application
15.3.1.1 Personal Data
The GDPR applies to the processing of personal data (Article 2(1)) if none of the exceptions in paragraphs 2–3 apply. This processing includes both the collection of training data and the training of the models, as well as the storage and use or sale of the model to generate output based on user requests.
The processing of personal data begins with step one, the collection of vast amounts of data with which to train an LLM. As the effectiveness of LLMs is directly linked to the breadth and variety of their datasets, this data is obtained by scraping content from numerous websites. Inevitably this often includes personal data (Article 4(1) GDPR) such as names, dates of birth, or other identifying information.Footnote 33 As personal data also includes incomplete or indirect details which may result in an individual being identified through additional information,Footnote 34 this processing is covered by the GDPR, even before the model is trained or released.
In the second step of data processing, the training of the model, identifying personal data becomes more challenging, as the final trained model may differ from its training data. An artificial neural network is represented by a large matrix of numbers, determined by weights and other parameters such as activation thresholds.Footnote 35 While the training data may include personal information, the data in the model may not necessarily retain that characteristic: personal data may be anonymised where advanced techniques such as differential privacy and federated machine learning are used during the training process to remove references to the training data.Footnote 36
A trained model resulting from such anonymisation that makes the reconstruction of training data impossible or highly unlikely is not considered to constitute personal data. However, the current popular large language models tend to persist in producing identifying information, whether by design or by accident. It cannot therefore always be assumed that model data has been fully anonymised: research into this ‘remembering’ phenomenon is ongoing. This is critical from the GDPR standpoint as the storage of the model also constitutes data processing under the GDPR if the model data is not properly anonymised. In addition, many authorsFootnote 37 argue that the anonymising of personal data itself is also a processing operation that requires justification under the GDPR.Footnote 38
In processing step number three, the production of output, the models or applications using them can produce personal data. Whether the information provided is correct or not is immaterial: when an LLM produces outputs that contain the names and bibliographical information of real people, they are processing personal data. Additionally, individuals can often be easily identified from the context of the text prompt or text output, or by using search engines. LLMs linked to search engines may also facilitate identification. Particularly in the case of public LLMs, it is likely that many data subjects can be identified for the reasons mentioned above.Footnote 39 It is important to note that the people in the training data are not theoretically the same as those produced in the output data even where they have the same name, as LLMs can also generate names of existing people, for example by producing information to users can then assign to individuals.
15.3.1.2 Territorial Scope of Application
Article 3(1) of the GDPR states that the Regulation applies ‘to the processing of personal data in the context of the activities of an establishment of a controller or a processor in the Union, regardless of whether the processing takes place in the Union or not’. Thus, the processing of personal data does not have to take place in the European Union itself, but can be performed on servers that are for example based in the United States or other third countries. As mentioned above, lex loci solutionis (Article 3(2)) means the requirements apply if the data processor offers its services to EU citizens, even where the processor is not located in the EU. This therefore brings global technologies including LLMs and other AI models such as ChatGPT, Bard, and Gemini squarely under the GDPR where these are accessible from the European Union.
15.3.2 Legal Basis for Data Processing
All processing of personal data within the scope of the GDPR requires a legal basis (Article 6(1) GDPR). The question of the legal basis for data processing across the life cycle of a generative AI system poses different problems, as it depends on the stage of the data processing. As argued before, it is essential to distinguish between the different steps of data processing when analysing AI and data protection.Footnote 40
15.3.2.1 Collection of Training Data
The first step in the life cycle of a generative model is the collection of training data In the case of LLMs like GPT-4 or Bard, this step consists of scraping data from the internet. The indiscriminate scouring of almost the entire internet logically excludes the legal basis of consent (Article 6(1)(a)). In the absence of legal obligations or contractual relationships between the operators of LLMs and all internet users worldwide, the scraping of training data can only rely on the legal basis of legitimate interest provided in Article 6(1)(f) GDPR.Footnote 41
Article 6(1)(f) GDPR states that data processing is lawful if it is necessary for the purposes of pursuing the legitimate interests of the data controller or by a third party, provided these interests do not override the fundamental rights and freedoms of the data subjects requiring protection.Footnote 42 The ECJ has clarified that this provision lays down three cumulative conditions: (1) the pursuit of a legitimate interest by the controller or by a third party; (2) the processing of personal data must be necessary to pursue that legitimate interest; and (3) that the legitimate interest of the controller or of a third party are not outweighed by the interests or fundamental freedoms and rights of the data subject.Footnote 43
The fact that this constitutes the only plausible legal basis exposes the structural problem of data protection law in relation to data-intensive technologies, not least because whether Article 6(1)(f) provides a sufficient legal basis must be determined on a case-by-case basis.Footnote 44 There are indications that general interest may outweigh the purpose of processing, or that this can be assumed, if data subjects could reasonably expect their data to be processed for training purposes. However, the nature of mass data scraping makes it almost impossible to identify individual interests, and therefore cannot provide satisfactory answers in terms of current legal doctrine and legal systems.
15.3.2.2 Legitimate Interest
The term ‘legitimate interests’ is deliberately broad, to encompass legal, economic, or idealistic interests, excluding only hypothetical and public interests. The collection of data to train a generative model for commercial use is initially a legitimate economic interest protected by the freedom to conduct a business under Article 16 CFR. The argument that the European Court of Justice (ECJ) also cited freedom of information (Article 11(2) ECFR) as a legitimate interest transferable to generative model trainingFootnote 45 in the Google Spain caseFootnote 46 does not apply to models that are only accessible for a fee. Furthermore, search engines and generative models operate differently, and are therefore not comparable. Sources referenced in search engines can be deleted or corrected, whereas LLMs generate a unique new text for each question for which a new probability is calculated. If the output text is incorrect, it cannot be corrected for future outputs.
15.3.2.3 Necessity
The necessity test under Article 6(1)(f) requires that the processing of personal data be a proportionate means of achieving legitimate interests. Processing is considered necessary if the processing of personal data is essential to achieve the objective of the processor’s legitimate interest – in this case, training and putting an AI model on the market – and that these interests do not outweigh the rights of the data subject. In rare cases, where only anonymised data is sufficient to train the model, that training may not require personal data. However, anonymised data alone is generally not adequate for training generative models, even were such anonymisation possible in the training phase.
15.3.2.4 Balancing of Interests
This balancing of interests between processor and data subject must also consider the rights of data subjects under Articles 7 and 8 ECFR. Their interests are particularly affected when AI collects, combines, and contextualises personal data available on the internet in response to user queries.
There is a valid argument for the interests of the processor in providing a large amount of training data to ensure powerful generative models able to generate word sequences that correspond to human language. Nevertheless, it is not absolutely necessary to scrape data on a scale that covers almost all publicly accessible resources on the internet to develop generative models: datasets can be generated in other ways, such as through data donations, effective consent solutions, or data collection by the data controller itself. However, none of these alternative options would be able to create the required breadth of data. The question is therefore which specific interest of the processor is worth protecting. Meta, for example, publicly admitted that the main difficulty the acquisition of licences for copyrighted material would have imposed on the development of generative models was expense. The same argument applies to collecting training data in a manner compliant with privacy regulations: the approach would have required considerable resources. However, it is unlikely that cost savings can constitute a legitimate interest, and at any rate, an interest based on structural infringements has considerably less protective value.
In the Meta case,Footnote 47 the ECJ also ruled that the personalisation of content – Meta’s core business model – was not necessary for the operation of a social network. The ECJ went on, stating that legitimate interests do not adequately justify Meta’s practices of tracking and profiling individuals for the purpose of conducting its behavioural advertising business across its social platforms:
it is important to note that, despite the fact that the services of an online social network such as Facebook are free of charge, the user of that network cannot reasonably expect that the operator of the social network will process that user’s personal data, without his or her consent, for the purposes of personalised advertising. In those circumstances, it must be held that the interests and fundamental rights of such a user override the interest of that operator in financing its activity through personalised advertising, with the result that the processing by that operator for such purposes cannot fall within the scope of point (f) of the first subparagraph of Article 6(1) of the GDPR.Footnote 48
This raises substantial doubts about the ability of companies like OpenAI to defend the processing of vast amounts of personal data to establish a commercial generative AI enterprise, particularly given that such tools pose numerous emerging risks to identified individuals, including issues like disinformation, defamation, identity theft, and fraud.
Context is therefore critical in safeguarding privacy and data protection. The public accessibility of data on the internet, even where disclosed by the data subjects themselves, does not completely negate their legitimate interest in its protection. As Recital 47 states, the interests and fundamental rights of the data subject may in particular override the interest of the data controller where personal data is processed in circumstances or in ways that data subjects do not reasonably expect. Although it is now public knowledge that data posted on the internet may be processed in ways other than initially thought, it is also a question of the specific purposes of the processing. For example, the legitimate expectation of privacy means that decades-old or deleted posts, personal websites, and entries cannot be used in perpetuity to train commercial models. It is reasonable to suggest that the typical internet user does not expect, or intend, their data to be utilised as training material for LLMs for the financial gain of others. Therefore, the use of the data for training these models represents a secondary purpose. In most instances, it is unlikely that a data subject made their data publicly available to serve as a dataset for the financial gain of LLM providers, making the use of such publicly available data an infringement on contextual privacy.Footnote 49
Moreover, a legitimate interest must be determined within the broader European and national regulatory context. The broad scope of scraping also means that an unmanageable number of people are affected, which opens the claim of legitimacy to questions of proportionality. According to the German constitutional doctrine of the Federal Constitutional Court, a particularly large number of people being affected without cause can impact the claim of legitimacy. This impact is referred to as ‘scatter width’, and is a line of argumentation used by the ECJ.Footnote 50 Such effects also arise in the case of universal data processing, as almost all internet users are affected.
Additionally, the legitimate interest must also be lawful, meaning it should conform to all applicable laws and regulations, including the principles and other provisions of data protection law. This includes ensuring that processing aligns with the expectations of the data subject based on their relationship with the controller, adheres to the principles of data minimisation, and implements appropriate safeguards (Recital 50 of the GDPR). In the case of broad-scale internet scraping, individual interests are difficult to identify. However, there were concerns about the legality of scraping from the outset, including with regard to potential copyright infringements.Footnote 51 An interest pursued through structural infringements cannot be legitimate.
Compatibility with the principles of data protection law under Article 5 GDPR also plays a role in the balancing of interests. This requires legitimate interests be evaluated in terms of the fairness of processing (Article 5(1)(a)), purpose limitation (Article 5(1)(b)), data minimisation (Article 5(1)(c)), and data accuracy (Article 5(d)).
Therefore, legitimate interests cannot be assumed across all training data. The matter is made more complex given it is extremely difficult, if not impossible, to comprehensively exclude personal data pertaining to minors or special categories of personal data according to Article 9(1) from training data. To complicate this matter further, the point at which the processing of personal data ‘reveals’ special categories of personal data under Article 9(1) GDPR has not yet been conclusively clarified.
15.3.2.5 Training of the Model
It is worth undertaking a chronological review of the various data processing operations used for training the model. A key consideration is whether data anonymisation occurs during model training. If the data is anonymised in the course of training, a further processing of anonymous data would fall outside the scope of the GDPR. The prevailing view is that a legal basis is required for the anonymisation of data under Article 6(1).Footnote 52 In this context, anonymisation should be understood as normative rather than technical, in line with the ECJ ruling that holds that data has been anonymised. This is because the ECJ considers anonymisation has occurred even if it is technically possible, but unlikely, that the controller could carry out identification with the means available including additional information.Footnote 53 According to the court, data is considered anonymous under the GDPR if re-identification is illegal.Footnote 54
In principle, the anonymisation of personal data is generally easy to justify under Article 6 GDPR. The practice aligns with the principle of data minimisation and storage limitation. Effective and permanent anonymisation can serve the interests of both data subjects and data controllers: the former are protected from unauthorised interference with their fundamental data protection rights, while the latter are freed from some of the perceived burdens of complying with the stringent requirements of data protection law.Footnote 55 However, this argument struggles to hold in light of the volume of data, as effective consent from the data subjects pursuant to Articles 6(1)(a) and 7 GDPR cannot be obtained in practice.
While it may be possible to institute legal obligations to anonymise training data under Article 6(1)(c), this is not yet relevant in practice. This means that the legal basis of legitimate interest in Article 6(1)(f) may also apply to anonymisation. Generally speaking, this provision may provide adequate results as anonymisation will typically be in the interest of the data subjects themselves, so that at the very least a conflicting interest is improbable. In the case of larger LLMs, it is equally unlikely that a data subject will have an individual interest in non-anonymisation and, moreover, even where such interest of an individual data subject exists, it would outweigh other relevant interests, such as of the other data subjects. As a result, anonymisation is permissible.
Assessing the processing of special categories of personal data under Article 9 is more difficult: anonymisation would require a case to be made under Article 9(2). As described above, the processing of special categories of data cannot be ruled out for LLMs. Although the hurdles of Article 9(2) are high, the ‘made public’ provision of Article 9(2)(e) can also be considered here. Others argue for a teleological reduction of Article 9(1) for anonymisation.Footnote 56 Neither variant constitutes an infringement of data subjects’ rights where training data has been anonymised. These complex considerations alone show that there are gaps between the individual-based approach of the GDPR and the tools required for adequately regulating generative AI.
15.3.2.6 Generating Output
The output of generative language models may constitute the processing of personal data. Here, a distinction must be made between the processing of scraped training data and the processing of user data in the form of prompts entered while using the model. There is no legitimate interest in processing user data, for example in the context of input prompts when using LLMs. Instead, effective consent, pursuant to Article 6(1)(a) must be obtained, using a tool that needs to be critically evaluated in the digital space.Footnote 57 OpenAI had to update its privacy policy for EU users after being investigated by the Italian data protection authority. It now states: ‘We use the content you provide to improve our services, such as to train the models that run our services. Read our instructions on how to opt out of the use of your content to train our models.’Footnote 58 However, consent is only a valid basis for prompts containing personal information about the user themselves. If users create prompts that include personal data from other persons, they cannot validly consent on their behalf.Footnote 59
When a generative model is capable of producing output which includes personal data, the issue of how training data was collected remains relevant throughout its lifecycle. If there was no legal basis for collecting the training data, there is no legal basis for using it to generate output. Theoretically, legitimate interest could also be considered here under Article 6(1)(f), but must be assessed on a case-by-case basis according to the criteria described above. However, LLMs make individual assessments difficult because of the quantity of data they process. In addition, generative models are scalable in terms of their output, which means false information can be disseminated to a large number of users and third parties.
Output processing is also problematic in cases where models infer or disclose special categories of personal data under Article 9(1) GDPR. It has been shown that models can memorise and reproduce private and personal information such as phone numbers, addresses, and medical documents.Footnote 60 In the age of big data, it is now potentially possible to infer sensitive information from almost any data, especially if one includes the boundless category of political opinions, covered by Article 9(1) GDPR. This means that ‘normal’ personal data can reveal special categories of personal data covered by Article 9(1), although the criteria for distinguishing between general and sensitive data remain contested.Footnote 61 One proposed criterion goes to the intention behind the data processing. Scenarios involving context-specific information could lead to the generation of sensitive data depending on the purpose of evaluation.Footnote 62 Court rulings tend to support this assumption: the ECJ seemed to interpret ‘revealing’ broadly in the Meta case,Footnote 63 and in another ruling, the Court decided that the disclosure of a spouse, partner, or cohabitee’s name could potentially indicate the sexual orientation of the applicant.Footnote 64 The Court has established minimal criteria for what constitutes the ‘revealing’ of sensitive data: the act of an ‘intellectual operation involving comparison or deduction’ is deemed sufficient to extend the special protection regime meant for sensitive data to personal data that is not inherently sensitive. However, this judgment was not directly related to big data, leaving the distinction somewhat ambiguous.Footnote 65
Consequently, in many instances involving big data, merely being able to potentially infer sensitive information may subject processes such as AI training to the provisions of Article 9. and there is little likelihood that LLMs satisfy the exceptions in Article 9(2). For instance, the research exemption under Article 9(2)(j) is restricted to the development of models for research purposes and does not permit their commercial exploitation, as indicated in Recitals 159 and 162.Footnote 66
Another important distinction is whether LLM output can be used to infer sensitive information about individuals that they have not made public themselves. Even if certain indicators, for example of political orientation, are available on the internet, LLM output may aggregate this information. As such, Article 9(2)(e) does not constitute a legal basis for this type of derivation.Footnote 67
Data accuracy requirements (Article 5(1)(d) GDPR also apply to LLM output. Language models have been shown to ‘hallucinate’ and produce incorrect information, including incorrect personal data.Footnote 68 Under the GDPR, operators are responsible for ensuring data accuracy (Articles 5(2), 24, 25(1) GDPR).Footnote 69 Although all popular applications provide disclaimers to inform users that the models may not always be correct, the effect of such notifications are questionable given automation bias.Footnote 70 Even if the current error rate of LLMsFootnote 71 does not justify generally prohibiting such applications on the basis of ensuring data accuracy, it does affect data subjects’ rights. The right to data accuracy becomes even more significant if the right to rectification or erasure cannot be effectively enforced.
15.4 Data Subjects’ Rights
As with other areas of data-intensive technology application, there are problems with the enforcement of data subjects’ rights in the case of generative models.Footnote 72 In general, the actors involved mean many data-driven AI technologies are developed, promoted, sold, and used by a handful of big tech companies, which establishes an informational power asymmetry between the powerful processors and the users. As a result, privacy rights alone are insufficient to address the issue of data disempowerment.Footnote 73 Individuals typically are not able to fully manage their personal data, as there is a fundamental limit to the control they can exert. While rights can afford a modest degree of influence in certain isolated cases, this influence is too sporadic and disjointed to significantly safeguard privacy. Ultimately, rights function primarily as a minor element within a broader framework.Footnote 74
The sheer quantity of the data processed from various sources seems to make it impossible to identify and inform individuals of the processing, or of the processor, to enable data subjects to assert their rights with regard to the processing of their data. Therefore, the data quantity realistically rules out compliance with the data subject’s right to information. In practice, reporting indicates that companies such as OpenAI and Midjourney have not responded to requests for information from people who found themselves in the training data.Footnote 75
The prerequisite for the exercise of the data subjects’ rights under the GDPR provided in Articles 13–22 is, first and foremost, that the data subject is aware of the data processing. Users providing input into an AI model in the form of prompts are covered by Article 13 GDPR. However, Article 14 GDPR also comes into play where data has not been collected from the data subjects themselves. According to both standards, data subjects must be informed about who has processed which data (categories of data), for what purposes, on what legal basis, and whether this data has been disclosed to third parties. These transparency provisions have the specific purpose of enabling data subjects to exercise their other rights, such as the right to erasure or rectification. Article 14(5) states the exception that the transparency obligation does not apply where and insofar as the provision of such information proves impossible or would involve a ‘disproportionate effort, in particular for processing for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes … In such cases the controller shall take appropriate measures to protect the data subject’s rights and freedoms and legitimate interests, including making the information publicly available’.
Again, it depends on the individual case, although it seems doubtful whether LLM operators can invoke unreasonableness if they already knew before the model was developed that individual requests for information could not be enforced. In any case, the principle of responsibility in Article 5(2) GDPR means a failure to respond to such requests or a reference to general impossibility is not sufficient.
Practice has shown that LLMs and possibly other generative AI models that produce content operate almost universally, not just at an individual level. This near-universal infringement reflects the profound mismatch between data-intensive models and the individual rights approach to data protection taken by data protection laws. As a result of this universality, other rights of data subjects such as the right to rectification (Article 16 GDPR)Footnote 76 and the right to erasure (Article 17 GDPR) exist on paper, but become unenforceable in practice.Footnote 77 Furthermore, removal requests from an individual data subject cannot produce the intended outcome, particularly in cases where the same information has been disseminated by multiple users interacting with the LLM.Footnote 78 In essence, simply deleting data from a training dataset offers only a superficial remedy, as it does not guarantee the elimination of the ability to retrieve that data or extract related information embedded within the model’s parameters.Footnote 79 As the output of certain machine learning models is shaped by the data used during the training phase, the original training data or information related to removed data may be deduced or ‘leaked’.Footnote 80
15.5 Responsibility
In addition to the various steps of data processing in generative models, multiple parties could potentially be considered data controllers under the GDPR due to their levels of involvement. The GDPR establishes three categories of responsibility for data processing in relation to the data subject: controller, processor, and third parties.
The data controller is primarily responsible for compliance with the provisions of the GDPR (Article 5(2)). As defined by Article 4(7) GDPR, the data controller is a ‘natural or legal person … which, alone or jointly with others, determines the purposes and means of the processing of personal data’. Article 4(8) goes on to define the processor as a ‘natural or legal person … to which the personal data are disclosed, whether a third party or not’. Third parties, on the other hand, are actors other than the data subject, controller, or processor (Article 4(10)).
Prima facie, the legal companies that develop and deploy generative models are data controllers. However, a differentiated picture emerges in the various steps of data processing. Indisputably, companies like OpenAI and Google act as data controllers in relation to the processing steps involved in establishing the parameters for foundational training and storing of the model, given they exclusively determine the modalities of data processing, such as the decision to release a freely accessible LLM. However, in terms of output production, generative models process data based on the prompts from their users. Whether this can be used to establish providers and users as joint controllers within the meaning of Article 26 GDPR remains an open question.
Joint controllership under Article 26 GDPR refers to the situation where two or more controllers jointly determine the purposes and means of data processing. In contrast, the relationship between a controller and a processor (Articles 4 (7, 8) and 28 GDPR) is different, as in this constellation, the processor processes data on behalf of, and subject to the instructions of, the controller. Joint controllership is thus a relationship of equality, whereas the data processor operates as a contractor, following instructions issued by the data controller. Whether this structure can be transferred to the relationship between providers of generative models and those who use them is questionable.
Users are not considered processors within the meaning of Article 28 GDPR, as while there is a contract between them and the providers, they do not have the obligations of a processor, especially those imposed by Article 28(3) GDPR, as they may generate prompts at will, and are not processing data according to instruction. The purpose of generative models is to enable users to freely use the model for their own defined purposes, for example to formulate letters, find cooking recipes, or revise texts, free from instructions from the provider.
Users and providers could therefore be joint controllers, but this would require them to jointly define the purposes of data processing, and set out transparent mutual obligations. This classification is supported by the fact that users and providers both influence the purposes of data processing: the providers of generative models set the basic framework within which their models are used, while users specify the purposes according to their individual needs. Consequently, both users and providers are interdependent and have a reciprocal effect on data processing. However, this is contradicted by the fact that users also tend to be data subjects, and according to Article 26(3) GDPR, data subjects have the right to bring claims against any of the joint controllers. Although the law does not require joint controllers to hold the same level of responsibility, mere contributory causation without cooperative action is not sufficient for joint responsibility.Footnote 81 Additionally, users’ limited influence over data processing means users of generative models cannot effectively be held responsible to third parties, as users have no ability to grant rights of access to providers or to delete personal data from the training data.
The relationship between users and providers of generative AI therefore presents a special case that cannot be seamlessly subsumed under the categories of the GDPR. On the one hand, users are more than just data subjects, as their active inputs are required to generate and shape the model’s output. On the other hand, they are neither data processors nor joint controllers, as they have no influence over the fundamental modalities of data processing. For instance, providers are able to simply deactivate models or make them subject to fees (as in the case of ChatGPT). The ECJ considers the extent to which data controllers participate in the joint data processing and the specific processing phases within which this occurs to be crucial.Footnote 82 For LLMs, users only participate in output generation, which is significantly dependent on the previous steps, such as training. The purpose and aim of the regulations concerning joint controllership is to counteract a diffusion of responsibility among multiple participants. Affected individuals should be able to clearly identify who is collecting their personal data, and for what purpose (Recital 58). Therefore, while providers may be responsible for user-generated content in the case of generative models, the inverse does not apply. This follows from the reasoning and fundamental rights protection of the GDPR provisions regarding responsibility and also corresponds to technical and economic reality.Footnote 83
15.6 Images, Audio, and Videos as Personal Data
The considerations for LLMs are not always transferable to generative models that produce audio, images, and video. This is because the aim of these models is not to generate information, which may be incorrect, but to generate new audio or visual material. The primary aim of generating new content has given rise to many copyright issues arising in these cases.Footnote 84 Images and videos can also be personal data if they can be used to identify the person, something easily achieved today through image searches.
A major problem is the significant increase in deepfakes in the digital context, which now affects not only public figures but also the general population. Women in particular are often victims of deepfake pornography, where explicit images and videos are generated using their images, without their consent.Footnote 85 This is an unlawful processing of personal data that violates the GDPR and, in many cases, national regulations. The photographed image of a person in these cases constitutes personal data if the person is still alive, regardless of whether the data is fake or not. The purpose of deepfakes is to disparage or discredit a specific individual, thus fulfilling the decisive characteristic of Article 4(1) GDPR, namely that the person is identified or identifiable. Voices may also constitute personal data, if the person is identifiable; visual or acoustic identification methods recorded using pattern recognition, such as facial or voice recognition (speaker recognition), can even be considered biometric data under Article 4(14) GDPR.Footnote 86
The latest addition to the digital legislation cavalry, the AI Act, only imposes a labelling obligation on deepfakes (Article 50(4)), leaving considerable doubt as to whether there is an adequate level of legal protection at European level.Footnote 87
15.7 Conclusion and Outlook
The popular use cases of generative AI models show that data protection law is reaching its limits when it comes to regulating data-intensive technologies. In addition to the problems highlighted here, further questions arise regarding the principle of purpose limitation for data processing for data-intensive models and their downstream applications. The use of LLMs in decision-making situations raises questions about the scope of the prohibition in Article 22 GDPR.Footnote 88
Structural problems of almost universal concern exist between the GDPR’s focus on individual protection and the volume of data processed for training purposes, and in terms of a structural enforcement deficit, particularly regarding data protection principles and data subjects’ rights.
As important as the structure of data protection law is for the protection of fundamental rights, new solutions are needed for the structural challenges posed by generative AI and other data-intensive technologies. These may also lie outside data protection law. To address these challenges, it is important to recognise the structural dimension of AI as a socio-technical development.Footnote 89 As a result, there is a need for structural solutions that go beyond the enforcement of individual rights.Footnote 90 Unfortunately, these issues remain unaddressed, as the AI Act does not contribute solutions to remedy the structural and specific challenges posed by the GPDR’s individual rights focus. Despite its stated goal of protecting fundamental rights including data protection, the structure of the AI Act follows product safety law parameters and as such has an approach fundamentally different from legal frameworks aimed at protecting fundamental rights, as found in the GDPR. The AI Act as part of a digital legislation framework that takes a variety of approaches to protect from risks posed by AI does establish certain obligations for generative AI systems under Articles 51–54, like technical documentation (Article 53(1)(a)), a policy to comply with union law (Article 53(1)(b)) and a summary about the content used for training (Article 53(1)(d)), but these provisions do not address the privacy and data protection of users. Additionally, there are broad exceptions for open source models (Articles 2(12) and 53(2)). This still leaves a need for legal regulation.