The ICO has recently concluded its third call for evidence on generative AI, which focused on how the accuracy principle of data protection law applies to the outputs of generative AI models, The first and second of these calls for evidence focused on the lawful basis for web scraping to train generative AI models and on the application of the data protection principle of purpose limitation at the different stages of the generative AI lifecycle.
Focusing in particular on Large Language Models (LLMs) (a particular model of generative AI perhaps most commonly associated with models such as Open AI’s Chat GPT) as an example, this blog will discuss the initial key takeaways from the third instalment in the ICO’s consultation series on generative AI.
The accuracy principle in data protection
The accuracy principle is one of the six data protection principles set out in the UK GDPR. It emphasises that any personal data collected about individuals must be accurate and, where necessary and applicable to the context of the purpose of processing, kept up to date.
In practice, the application of this principle protects individuals from having false information processed about them, thus ensuring that they are protected from any associated negative impact to their privacy. The accuracy principle also ensures that decisions made about people using generative AI models are not based on inaccurate data.
The vast capabilities (and potential dangers) of generative AI models
LLMs have brought the years of research on generative AI by companies such as Open AI, Meta and Google to the forefront of global public interest. In essence, LLMs learn to understand and process text like humans do. They are trained on vast amount of data, and the outcome of their application is an ability to generate coherent responses to prompts, answer questions relating to general conversation and even help users in creative writing or creating recipes.
While, on the face of things, this seems fairly benign, to take the specific example of Chat GPT, applications based on LLMs are usually trained on large amounts of readily available text data from the internet, including books, websites, articles and social media as well as the queries and prompts that are inputted by users of the applications. This, however, does not mean that the output of the text such models generate is necessarily accurate. LLMs are prone to what have been termed as “hallucinations”, incorrect output caused by how these models are trained to identify and generate wording to stay on the topic of whatever they are prompted with. This does raise certain concerns about the accuracy of the information disseminated through these models and the harm that this can cause.
For instance, a law professor in the US was incorrectly included in a list of legal scholars accused of sexual harassment that was generated through Chat GPT, which cited a non-existent Washington Post article . In March of last year, a mayor in Australia threatened Open AI with a defamation action when inaccurate information generated through Chat GPT falsely identified him as being involved in a foreign bribery scandal.
The ICO’s conclusions from their initial analysis of the relationship between the accuracy principle and generative AI
At the training stage of generative AI models, the ICO expects developers to understand the accuracy of the training data being used to develop these models and specifically recommends that developers:
- Know whether the training data relating to individuals being used in these models is made up of accurate and up to date information, or whether it is based on historical information, opinions or inferences;
- Understand how the accuracy of this training data affects the outputs of these models;
- Consider whether the statistical accuracy of the output of these models is sufficient for their specified use, and its impact on accuracy from a data protection perspective; and
- Ensure that there is clear, transparent and concise communication of the recommendations above to deployers and end-users of generative AI models.
Our interpretation of the ICO’s analysis is that it reinforces the importance of ‘baking’ data protection into the development of generative AI models. Applying the GDPR requirement to keep personal data accurate and up to date is not straightforward. This is particularly so, given that some level of data inaccuracy may be tolerated if statistical accuracy is proven sufficient and that not all personal data needs to be kept up to date, for example, in the context of historical records, news articles or a snapshot in time.
This creates some grey areas and undermines the reliability of AI generated output. We recommend that organisations consider taking the following steps to mitigate risk associated with data accuracy:
- Assess and document the accuracy of training data as part of your data protection impact assessment, using the ICO’s AI toolkit as a guide;
- Where data inaccuracies are identified, assess the potential risk to individuals and whether those risks can be mitigated ahead of deployment to end-users. For example, test the mathematical formula the model will rely upon, to ensure the output is sufficiently statistically accurate for your purposes;
- The statistical accuracy of the output generated by any AI model and its intended use, must be clearly and transparently communicated to end users. In the case of the Australian mayor and American law professor, it can be argued that perhaps Open AI should have provided more information about the reliability and statistical accuracy of Chat GPT’s output. This could have reduced the harm caused to these individuals based on the false information disseminated about them; and
- Ongoing monitoring of the accuracy of the output of AI models is necessary to demonstrate compliance with the accuracy principle of GDPR and to mitigate risk to individuals.
What happens next?
The fourth and final call for evidence in the ICO’s consultation, focusing on how developers enable individuals to exercise their rights during the training and fine-tuning of generative AI models, will conclude on the 10th of June. The input received from the four calls for evidence during this consultation series will in turn assist the ICO in updating its guidance on generative AI, providing more clarity for organisations wishing to develop or deploy such models.
In the meantime, if you have any questions regarding generative AI and its data protection implications, please contact our specialist Data Protection Team on 03330 430350.