In the actual and evolving age of artificial intelligence (AI), large language models (LLM) have become a popular tool because of their capability to analyze and generate human-like text in a previously unseen way. These models, usually trained with huge datasets, use attention modules within their architectures to process and understand the text input, allowing the easy development of different applications such as chatbots, translation tools, content generation, and so on.
However, despite their immense potential, some ethical implications about their use appear in the landscape, particularly with respect to Personally Identifiable Information (PII), which usually contains sensitive information. Within this blog, we will explore some of the challenges associated with protecting user information in the era of LLMs and how different approaches are developed to protect this information when LLMs are being used.
It is widely known that LLMs are the state-of-the-art of different applications, from understanding and analyzing information to generating coherent and contextual text. These models have changed the way that different applications were traditionally made because of their versatility. This is evident in their application as virtual assistants, translators, sentiment analysis, or content generation, for example.
However, besides the positive impact these models bring to our technological development, there are a number of ethical concerns about their responsible use and handling of data, especially during their training. This is because the models require vast amounts of training data, and non-sensitive data could be transformed into Personally Identifiable Information (PII) during inferences if users share confidential information with the LLMs, making privacy a crucial step during the use of LLMs to mitigate possible associated risks.
Here, Personally Identifiable Information (PII), according to the General Data Protection Regulation (GDPR), is “any information relating to an identified or identifiable natural person (‘data subject’)..., in particular by reference to an identifier such as name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.”
Common examples of PII are:
This information could expose individuals to different risks, such as identity theft, financial fraud, or unauthorized access, for example, if it gets into the wrong hands.
A recent experiment by scientists at Google’s Deep Mind highlighted this risk when they asked ChatGPT to endlessly repeat the word “poem”. This, concerningly, bypassed the application’s guardrails and resulted in the chatbot revealing parts of its training data, which contained personally identifiable information such as email addresses and phone numbers. This was not the first time that malicious prompt injection was able to bypass the guardrails of AI chatbots – researchers at the University of California at Santa Barbara bypassed safeguards by providing the application with examples of harmful content.
Given these vulnerabilities and the sheer amount of users of generative AI chatbots, it is crucial that steps are taken to protect PII when using LLMS. While regulation and guidance can support this, there are also technical solutions that can be implemented.
For example, a number of technical strategies can be applied to mitigate any personal data leakage, especially during the dataset curation by using anonymization. To do this, personal information is removed or modified to prevent the direct association of personal information with specific individuals.
For example, let’s imagine that our original information is:
This can be anonymized by masking it with any symbol, such as:
Or by using entity labels, such as the pii-masking-200k dataset:
or encryption, which encodes personal data to add a robust layer of security:
Furthermore, even pre-trained LLM models are being used to integrate the mentioned strategies to detect and remove PII information from datasets or public data by using prompt engineering.
For example:
Model’s output:
More information about data anonymization and encryption can be found in this paper, which expands on the concept of masking commercial information.
Outside of anonymization, a number of different tools have been developed to help programmers with PII anonymization. These include the well-known Spacy’s Named Entity Recognition (NER) tool, which identifies words related to names, locations, dates, etc. Another interesting tool is Microsoft’s Presidio Analyzer SDK, which provides modules not only for identification but also for the anonymization of private entities from texts by using different recognizers. Finally, the Amazon Comprehend feature available from AWS also allows for the detection and anonymization of PII from a given text.
Finally, there are a number of legal resources that set out guidance and legislation governing the use of PII, introducing requirements such as consent and placing limitations on how PII can be used.
These include the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) provide frameworks that regulate the use and collection of PII by businesses to provide users with control over their personal data. There are also PII laws that target specific applications, such as the Health Insurance Portability and Accountability Act (HIPAA) and the Fair Credit Reporting Act (FCRA) which protect the privacy and confidentiality of health and credit information, respectively. Finally, the National Institute of Standards and Technology (NIST) in the US published a widely-known “Guide to Protecting the Confidentiality of Personally Identifiable Information (PII)” in 2010 that provides a powerful guide to assist organizations in protecting personal information.
Personally identifiable information presents significant challenges with the use of AI that must be addressed to uphold the safety of users and the general public. Accordingly, efforts must be made to maintain the balance between technological advances and the protection of user privacy, and doing so requires a combination of legal and technical expertise.
Schedule a demo to find out how Holistic AI’s Safeguard can help you secure your data and govern your generative AI.
DISCLAIMER: This blog article is for informational purposes only. This blog article is not intended to, and does not, provide legal advice or a legal opinion. It is not a do-it-yourself guide to resolving legal issues or handling litigation. This blog article is not a substitute for experienced legal counsel and does not provide legal advice regarding any situation or employer.
Schedule a call with one of our experts