Privacy Identification of Human–Generative AI Interaction

doi:10.1017/9781009587877.004

3 - Privacy Identification of Human–Generative AI Interaction

Published online by Cambridge University Press: 19 September 2025

Dan Wu and

Guoye Sun

Edited by

Dan Wu and

Shaobo Liang

Show author details

Dan Wu: Affiliation:
Wuhan University, China
Shaobo Liang: Affiliation:
Wuhan University, China

Book contents

Summary

Generative AI based on large language models (LLM) currently faces serious privacy leakage issues due to the wide range of parameters and diverse data sources. When using generative AI, users inevitably share data with the system. Personal data collected by generative AI may be used for model training and leaked in future outputs. The risk of private information leakage is closely related to the inherent operating mechanism of generative AI. This indirect leakage is difficult to detect by users due to the high complexity of the internal operating mechanism of generative AI. By focusing on the private information exchanged during interactions between users and generative AI, we identify the privacy dimensions involved and develop a model for privacy types in human–generative AI interactions. This can provide a reference for generative AI to avoid training private data and help it provide clear explanations of relevant content for the types of privacy users are concerned about.

Keywords

Generative AI Privacy Leakage Privacy Type Human–AI Interaction

Information

Type: Chapter
Information: Human-AI Interaction and Collaboration , pp. 43 - 81

DOI: https://doi.org/10.1017/9781009587877.004 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2025

3 Privacy Identification of Human–Generative AI Interaction

3.1 Introduction

At the end of 2022, OpenAI released ChatGPT, a large language model chatbot created using the GPT-3.5 model, sparking a heated public discussion about generative artificial intelligence (GenAI). In February 2023, Microsoft updated Bing to integrate the AI technology behind the OpenAI, and the craze continued. Users can now speak into the search box and get targeted answers to their questions. This is where the new generative AI technology comes into play, which has brought the discussion about generative AI to a peak. Due to the surge in large language models and the increase in computing power, more and more such generative AI products have appeared on the open market. These include Google’s Bard, Baidu’s Wenxin Yiyan, Anthropic’s Claude, and others. Generative AI represents a new generation of AI driven by large language models, allowing humans to control AI to create text, images, videos, and more.

It has demonstrated remarkable capabilities, for example, passing college-level exams (Choi et al., Reference Choi, Hickman, Monahan and Schwarcz2021), and has achieved remarkable results even in areas that are considered unsuitable for machines, such as creativity (Chen et al., Reference Chen, Sun and Han2023). Generative AI based on large language models has been widely used in various scenarios with its powerful advantages, including, for example, educational work (Baidoo-anu & Ansah, Reference Baidoo-anu and Ansah2023), code writing (Dwivedi et al., Reference Dwivedi, Kshetri, Hughes, Slade, Jeyaraj, Kar, Baabdullah, Koohang, Raghavan, Ahuja, Albanna, Albashrawi, Al-Busaidi, Balakrishnan, Barlette, Basu, Bose, Brooks, Buhalis and Wright2023), and medical health (Cascella et al., Reference Cascella, Montomoli, Bellini and Bignami2023). It can be easily accessed through a web interface, which has led to its widespread adoption by humans. Further, the interaction between humans and generative AI has gradually increased and deepened. The most obvious evidence is the number of users of ChatGPT, which is the fastest product to reach 100 million users in history and is still growing rapidly (Porter, Reference Porter2023). It should be noted that, although such transformative tools have brought great help to human work and life, critical and inevitable privacy issues have also emerged.

The essence of generative AI is to try to imitate human capabilities and produce content closely related to humans through large-scale learning and training data. With the widespread application of this technology, the problem of privacy leakage has become increasingly prominent and needs to be solved urgently. This is because its interaction with humans is based on the collection and processing of human data, which itself poses the risk of privacy leakage. On the one hand, these large language models rely heavily on massive data sets for training, which usually come from public Internet resources, social media, and even private communication records (Brown et al., Reference Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter and Amodei2020). Due to the diversity and extensiveness of data sources, the protection of data security and privacy has become extremely complicated. On the other hand, when users use generative AI, they will share various types of data with the system, such as text, voice, images, etc. After being collected, these data may continue to be used for model training to further improve the performance of the system. This training process may cause personal data to be accidentally leaked in future outputs. Specifically, the risk of privacy leakage is closely related to the inherent operating mechanism of generative AI. In the process of content generation, users’ sensitive information may be inadvertently embedded in the generated results, posing a threat to user privacy (Carlini et al., Reference Carlini, Tramèr, Wallace, Jagielski, Herbert-Voss, Lee, Roberts, Brown, Song, Erlingsson, Oprea and Raffel2021).

To this end, both the industry and academia have proposed some methods to solve the privacy leakage of generative AI, including methods to optimize the boundary constraints of data and generated content. After the training data is optimized, researchers will be obliged to ensure that the data does not contain any sensitive information, and to minimize the risks of privacy dimensions such as data screening and cleaning through technical means. Generative AI developers can set content filters, sensitive information identifiers, etc., so that the generated results do not contain the user’s personal data. These measures have alleviated the risk of privacy leakage to a certain extent, but the defects are still large.

Therefore, solving the problem of privacy leakage of generative AI in the future requires the joint efforts of developers and users. On the one hand, the developers of generative AI should adhere to the principle of privacy protection during the development and training process, exclude sensitive content during data training, and also perform privacy protection processing in the generated results. On the other hand, users should be given clear guidance to let them understand the operating mechanism of generative AI to improve their awareness of privacy-related issues. In the process of interaction between generative AI and users, they should know what types of information will be obtained and how they should act to protect privacy.

In this chapter, we focus on identifying the privacy involved, and establish a privacy type model for human-generative AI interaction by focusing on the exchange of privacy information during the interaction between users and generative AI. We will start with a theoretical review to explore the issues and existing methods of privacy protection in generative AI. Subsequently, we will establish a privacy type model to describe in detail the privacy types and scenarios that may be involved in human-generative AI interaction. Through the classification and analysis of various privacy types, we can understand the privacy risks of generative AI in different application scenarios. Finally, we will explore the value and importance of the privacy type model in practical applications. Generative AI has broad application potential in knowledge sharing, search, management of health information, scientific discovery, etc. The privacy type model can be further utilized in these fields to better protect user privacy and improve the credibility and user satisfaction of generative AI. The research in this chapter, on the one hand, lays a theoretical foundation for subsequent privacy research in human-generative AI interaction and, on the other hand, it also provides specific references for the application practice of privacy protection measures. Specifically, this chapter systematically identifies privacy in the interaction between humans and generative AI, and proposes a scientific and effective privacy type model, hoping to provide reference and guidance for users and developers, ensure user privacy security, and promote the healthy development of generative AI technology, thereby achieving a more secure and reliable human–AI interaction environment.

3.2 Related Concepts

3.2.1 Privacy Leakage in Generative AI

The privacy leakage problem in generative AI has become a research hotspot in academia and industry. This section will explore the privacy leakage problem in generative AI from two aspects: data source and operation mechanism, combining existing research and practice.

Data Source

Generative AI models are large-scale and involve massive amounts of data from various sources, much of which may be related to private information. The diversity and breadth of data are the basis for generative AI to generate high-quality content, but it also brings about the problem of privacy leakage. The data collection and processing process links generative AI with privacy issues and the root of these issues is that the source of data directly affects the security of user privacy. This section explores how the source of data creates privacy issues in the interaction between people and generative AI. The vast majority of data required for generative AI systems is trained using data extracted from the Internet, social media platforms and other public domain sources, which means that it is difficult to determine whether there is any form of sensitive information in the training data.

Internet data can be said to be one of the most widely used data sources in the training process of generative AI models, and a large part of it is created by Internet users. For example, GPT-3 is trained using data from the wider Internet, including Wikipedia, news articles, etc. (Brown et al., Reference Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter and Amodei2020). Often, this data may contain sensitive personal information, such as names. Unless this information is filtered in advance, the model may inadvertently embed this information when training content. Social media data is a key type of Internet data that enriches the data source of generative AI. User interactions on social media are very rich, from public posts on blogs to comments on social media, which are reused without permission. Direct use of this data in model training will leak private information (Carlini et al., Reference Carlini, Tramèr, Wallace, Jagielski, Herbert-Voss, Lee, Roberts, Brown, Song, Erlingsson, Oprea and Raffel2021).

In addition, another important source of generative AI training data is public datasets, which are published by scientific research institutions, governments or enterprises and open to academic research and technology development. But, at the same time, publicly released datasets also contain some private information. For example, in the widely used ImageNet dataset, some images present personal information, including facial expressions and geographic locations (Deng et al., Reference Deng, Dong, Socher, Li, Li and Fei-Fei2009). Since the dataset is already in the public domain, the risk of privacy leakage increases during data sharing and reuse.

Another important source of data is that generative AI collects user data input in real-time during the interaction process. This data is not only used to provide customized services but also to continuously enhance and optimize the model. Human contact with generative AI is information-intensive, including chat records, voice interactions, and search records. If not protected, this type of information may be maliciously exploited or leaked, posing serious privacy risks. In order to reduce data leakage, data related to user interactions must be stored and sent using extremely strict security methods. However, most generative AI systems are not well protected in this regard. For example, if the user’s voice data is not protected during transmission, it may be stolen by a man-in-the-middle attack. In addition, when data is transmitted to a cloud server without proper access control, hackers may exploit weaknesses to obtain sensitive information from users. Although public datasets, online crawled data, user interaction data, etc. provide valuable training data, generative AI systems face new privacy challenges in terms of data sources.

Operation Mechanism

Generative AI is increasingly being used in fields such as healthcare (Leonard, Reference Leonard2023), finance (Estrada, Reference Estrada2023), and emotional counseling (Kimmel, Reference Kimmel2023). When using generative AI in these fields, individuals often disclose sensitive information, such as their medical data, financial status, or emotional harm. Users expose more private information when using generative AI due to three characteristics of its working mechanism, namely content integration, question–answer interactivity, and tool anthropomorphism (Wu & Sun, Reference Wu and Sun2023). These three characteristics distinguish privacy issues in user-generated AI interactions from privacy issues in other environments.

In terms of content integration, generative AI can receive structured text input from users and generate integrated answers in a unified format, showing the value association between semantic parts, thereby guiding users to include various private information when creating content. In terms of question–answer interaction, generative AI can have multiple rounds of dialogue with users and provide answers that are closer to the dialogue context and user needs based on the retrieval context and intent. This feature greatly increases users’ trust in the output of generative AI, but the immediacy and continuity of this association also make privacy issues complex and changeable. In terms of tool anthropomorphism, generative AI often adds anthropomorphic and emotional expressions to conversational discourse. By imitating the naturalness of human discourse, generative AI can often help users let down their psychological defenses and make it easier to input private information (Lu et al., Reference Lu, McDonald, Kelleher, Lee, Chung, Mueller, Vielledent and Yue2022).

The Impact of Privacy Leakage

Diverse data sources and operational processes make it easier and more complex for humans to leak privacy when interacting with generative AI, exposing users to a series of new privacy issues (Peris et al., Reference Peris, Dupuy, Majmudar, Parikh, Smaili, Zemel and Gupta2023). The new privacy challenges in user interactions with generative AI mainly stem from two factors. On the one hand, there are classic privacy issues such as data leakage and the use or sale of personal information (Kshetri, Reference Kshetri2023). Due to computing resource limitations and content review requirements, most popular generative AI based on large language models are run using cloud services. Cyber attackers can exploit vulnerabilities in developers’ systems to access users’ personal accounts and conversation history data (Carlini et al., Reference Carlini, Tramèr, Wallace, Jagielski, Herbert-Voss, Lee, Roberts, Brown, Song, Erlingsson, Oprea and Raffel2021). For example, ChatGPT was exposed to the leakage of some users’ names, email addresses, payment addresses, and credit card numbers (Golda et al., Reference Golda, Mekonen, Pandey, Singh, Hassija, Chamola and Sikdar2024).

On the other hand, previous studies have found that large language models remember information in training data and leak this information in response to specific prompts (C. Zhang et al., Reference Zhang, Ippolito, Lee, Jagielski, Tramer and Carlini2023). Since generative AI based on large language models currently uses user data to train models regularly, there is a risk that generative AI will output user privacy information to others. This means that private information entered by a specific user may be remembered by the model and leaked when responding to prompts from others. Although language models are used in classical AI, such as Siri, their applications are more limited, such as smart home control (Shalaby et al., Reference Shalaby, Arantes, GonzalezDiaz and Gupta2020). The openness of generative AI allows users to leak more private information, and the scale and intensity of such harms are likely to increase further compared to classical AI. For example, the knowledge provided by researchers to a large language model may be incorporated into the model, and generative AI can provide it to others without verifying the original source (Van Dis et al., Reference Van Dis, Bollen, Zuidema, van Rooij and Bockting2023). Existing research has shown that the training data of the model can be inferred from the generated text (Song et al., Reference Song, Ristenpart and Shmatikov2017). This attack method highlights how generative models leak training data while generating content.

Privacy leakage in generative AI not only threatens personal privacy and leads to the abuse of personal information but may also have adverse effects on enterprises and society. If corporate personnel fail to protect corporate-related privacy when using generative AI, technical or trade secrets may be leaked, resulting in economic losses to the company (Maddison, Reference Maddison2023). Large-scale privacy leaks may cause the public to lose trust in generative AI and hinder the development and use of new technologies. Therefore, understanding how to deal with new privacy challenges in the interaction between users and generative AI is crucial for developers to enhance the interaction experience between people and generative AI from the perspective of human–computer interaction.

3.2.2 Privacy Protection Methods

In Section 3.2.1, we discussed in detail the privacy leakage issues in generative AI, emphasizing the privacy risks and challenges faced by humans in the process of interacting with the system. In order to effectively deal with these issues, researchers and engineers have developed a variety of privacy protection methods. This section will comprehensively review the existing privacy protection methods, mainly involving data restrictions and user participation, and explore their advantages and disadvantages and practical application effects.

Restriction of Data

In the training process of large-scale language models, data restriction aims to reduce the potential risk of privacy leakage in the model training process by improving the selection and processing of data. It is an important means to ensure the privacy protection of generative AI models during the training process.

Data anonymization is a common privacy protection strategy that protects privacy by removing or blurring personal identity information. However, anonymization cannot completely guarantee that data will not be re-identified. For example, studies have shown that social media data may include users who use pseudonyms or usernames, but attackers can find users’ real names or even other personal information through natural language processing technology (Fire et al., Reference Fire, Goldschmidt and Elovici2014). Individual identities can also be reidentified by combining multiple de-identified data sets (Narayanan & Shmatikov, Reference Narayanan and Shmatikov2008). This de-anonymization technology shows that, although the data may be processed before being released, it is still possible for attackers to extract sensitive information from the anonymized data through data re-identification technology. It is difficult to completely protect user privacy by relying solely on anonymization processing.

In order to further improve the privacy protection effect of training data, differential privacy technology is introduced into data processing. Differential privacy effectively prevents over-reliance on and potential leakage of training data by introducing noise during model training, ensuring that, even if an attacker has external information, personal information cannot be recovered from the processed data (Abadi et al., Reference Abadi, Chu, Goodfellow, McMahan, Mironov, Talwar and Zhang2016). For example, some studies have enhanced privacy through differential privacy stochastic gradient descent, while also allowing the generated data to maintain high availability to support multiple visual tasks (Dockhorn et al., Reference Dockhorn, Cao, Vahdat and Kreis2023). Although it can provide strong privacy protection in theory, in practical applications, how to effectively introduce noise while ensuring the usefulness of data remains a challenge. Differential privacy still faces challenges such as high computational complexity and the balance between noise and data analysis accuracy in practical applications. Its effectiveness depends on the way noise is introduced and the specific characteristics of the data. If noise is introduced improperly, it may affect the validity of the data and the performance of the model (Dwork & Roth, Reference Dwork and Roth2014).

In order to prevent generative AI from leaking user privacy information in generated content, methods to restrict generated content have also emerged. These methods mainly reduce the risk of sensitive information leakage by introducing constraints or adopting special generation strategies during the generation process. Generative adversarial networks are widely used in content generation for generative AI. In order to protect privacy, researchers have proposed privacy-enhanced generative adversarial networks. For example, a study used a diffusion autoencoder to generate semantically meaningful perturbations, thereby facilitating the protected face to be recognized as another person (Liu et al., Reference Liu, Lau and Chellappa2023). However, the introduction of privacy-preserving constraints may also affect the quality and diversity of generated content. How to strike a balance between protecting privacy and maintaining content quality remains an important research topic.

Privacy by Design

The lack of transparency in the working mechanism of generative AI is particularly prominent compared to traditional information acquisition tools. The list of links provided by traditional search engines allows users to choose trusted sources of results, while the training data and working principles of large language models are little known (Stokel-Walker, Reference Stokel-Walker2023). This opacity makes it difficult for users to prevent privacy leaks (Li et al., Reference Li, Cao, Lin, Hou, Zhu and El Ali2024). Therefore, it is particularly important to let users understand how the private information they input is collected, processed and disseminated for privacy protection.

In this context, the concept of privacy-inclusive design has been applied in some human–computer interaction research (Wong & Mulligan, Reference Wong and Mulligan2019). Privacy-inclusive design refers to embedding privacy protections into products at the initial design stage, which has been incorporated into the EU’s General Data Protection Regulation and the US Federal Trade Commission’s policy recommendations. Traditional legal and regulatory means usually rely on ex-post penalties to implement privacy protection, while privacy-inclusive design provides a proactive approach to build privacy protection into the design process.

One of the main purposes of privacy-inclusive design is to provide information and support for privacy decisions. System design can help users make privacy-related choices and operations during use. To this end, existing research has been devoted to improving the design of privacy statements, from visual design to text content to presentation time (Kelley et al., Reference Kelley, Cesca, Bresee and Cranor2010). In addition, some studies have explored the design of user privacy controls, visual and interaction design, and architecture (Jancke et al., Reference Jancke, Venolia, Grudin, Cadiz and Gupta2001), or encouraged users to participate in privacy behaviors through the design of privacy prompts (Chang et al., Reference Chang, Krupka, Adar and Acquisti2016), thereby supporting user decision-making.

In these designs, privacy issues are conceptualized as information problems or insufficient tools for users. Therefore, the design of informing and supporting privacy decisions focuses on providing users with relevant information to encourage them to make more privacy-protecting decisions or providing tools and methods that enable them to more easily resolve privacy issues in practice. This implicitly assumes that, if users receive the right information or have the right tools, they will be able to participate in human–AI interactions in a more privacy-protecting way.

The Shortcomings of Privacy Protection Methods

Although there are many privacy protection methods, such as data anonymization, differential privacy, and privacy-incorporated design, these methods still have many shortcomings in practical applications in addition to their inherent flaws. First, although they can cope with privacy leakage issues in static data analysis and model training to a certain extent, there is an urgent need for real-time privacy identification and protection in dynamic scenarios where users interact with generative AI. There is currently a lack of private information identification mechanisms during user input and interaction, making it difficult for generative AI systems to effectively avoid the leakage of sensitive information.

Secondly, different types of private information have significant differences in sensitivity and protection needs. Existing privacy protection methods are often generic and lack differentiated protection for different types of private information. For example, the General Data Protection Regulation has stricter protection requirements for specific types of information such as health data and financial data. This shows that health data and financial data are much more sensitive than general types of data and unified protection measures may affect the former. Therefore, existing privacy protection methods need to be further optimized to better cope with the protection needs of different types of private information.

Third, transparency of privacy protection measures is the basis for user trust. Although existing privacy protection methods cover certain user participation, they often lack transparency, making it impossible for users to understand how the system handles their data. This not only increases the risk of privacy leaks, but also reduces users’ trust in the system. By increasing the transparency of privacy protection measures and allowing users to understand how the system handles and protects their data, it can enhance user trust and promote the widespread use of the system.

Given the shortcomings of existing privacy protection methods, we aim to perform more granular type identification of private information in generative AI systems. This can provide a basis for developing more sophisticated and effective privacy protection methods in the future. By identifying different types of private information and understanding which private information is more sensitive to users, more targeted and effective protection measures can be designed. For example, stronger encryption technology and strict access control can be used for certain types of data, while relatively loose protection measures can be used for other types of data. This can protect sensitive data while ensuring that the use of non-sensitive data is not subject to excessive restrictions. The model can better balance privacy protection and user experience, thereby enhancing the user experience.

In addition, different types of private information have different requirements in laws and regulations. Identifying privacy types helps ensure that generative AI systems comply with corresponding legal and regulatory requirements when processing data and avoid legal risks. At the same time, by clearly classifying privacy types and protecting them in a targeted manner, the transparency of generative AI systems can be improved, giving users a clearer understanding of what information is collected and how it is protected. This transparency can enhance users’ trust in the system and promote wider acceptance and use.

3.2.3 Context Integrity Theory

In recent years, norm-based privacy theories have received increasing attention. As one of them, the contextual integrity theory has been widely used to identify privacy violations in various contexts. This theory recognizes that people interact in different contexts, and each context contains specific expectations about “what types of information to share with whom under what circumstances.” Its core concept is that the protection of privacy depends not only on the protection of the data itself, but also on the flow and use rules of data in specific contexts (Nissenbaum, Reference Nissenbaum2004). The contextual integrity theory emphasizes that privacy is not just the nondisclosure or the nonpublicity of data, but the flow of data in appropriate contexts according to appropriate norms and expectations. Privacy management thus becomes a process of negotiating the social norms and expectations held by individuals. This theory provides a new perspective for our research, identifying the types of privacy involved in the interaction between humans and generative AI by considering the dissemination and use of private data in different contexts.

Specifically, the contextual integrity theory points out that different contexts shape the information flow norms in different scenarios, and these norms determine what kind of information use behavior is appropriate. Context-related information norms are composed of three categories: information subject, information type, and transmission principle (Nissenbaum, Reference Nissenbaum2004). The appropriateness of information flow depends on the examination of five elements in the three categories of norms: information sender, information receiver, subject involved in the information, information type, and flow conditions or constraints.

Contextual context integrity theory is often used to explore privacy norms in different environments, such as the Internet of Things (Apthorpe et al., Reference Apthorpe, Varghese and Feamster2019) and education (Shvartzshnaider et al., Reference Shvartzshnaider, Tong, Wies, Kift, Nissenbaum, Subramanian and Mittal2016). The five elements of the theory provide a method to study the subtle effects of information on privacy in a specific context. When one or more of these five elements do not meet the norms and expectations of individuals, privacy violations occur. Specifically, contexts (such as interaction contexts) are social spaces of privacy expectations involving multiple participants, including information senders, receivers, and individuals or groups as information subjects; the type of shared information (such as location information) and transmission principles (rules for how information is transmitted between participants) also need to be considered.

This study applies the contextual integrity theory to the interaction between humans and generative AI and maps five elements to specific elements in the context: the sender of information is the individual user who interacts with generative AI; the receiver of information includes the large language model of generative AI (direct receiver), its developers, operators, and other users (indirect receiver); the subjects involved in the information include individual users, other individuals, or groups; the information type is the various types of information input by the user, which is reflected in the context of privacy disclosure as the privacy type disclosed by the user; and the transmission principle is the condition or constraint in the flow of private information.

Through these detailed contextual elements, this study evaluates the privacy disclosure in the interaction between humans and generative AI, focusing on the possible application scenarios of the interaction, and using this as a theoretical basis to identify privacy types.

3.3 Construction of Privacy Type Model

3.3.1 Construction Principles of Privacy Type Model

In the process of building the privacy type model, this chapter is based on multiple dimensions and methods and follows a series of detailed and scientific principles. These principles are not only reasonable in theory but also highly operational and adaptable in practical applications. The following are the main principles followed in this study when building the privacy type model.

Contextual Principle

The contextual principle emphasizes that the way humans interact with generative AI and the types of private information involved may differ significantly in different contexts. Therefore, when building a privacy type model, it must be based on different interaction scenarios to provide a reference for the identification of privacy types. This principle ensures that the privacy-type model is suitable for the complex and varied practical applications of generative artificial intelligence. In privacy research, the concept of context has been widely used. Based on the contextual integrity theory, this chapter identified common interaction contexts for generative artificial intelligence, through content analysis and a literature review, and explored the common types of private information in each context. In order to comprehensively understand the privacy needs in different contexts, this study analyzed a variety of interaction contexts. In an educational context, private information may include student grades, etc.; in a medical context, private information may involve diagnostic records, etc. This classification helps this study to conduct specific analysis and identification of different contexts when building a privacy type model.

Developmental Principle

The developmental principle advocates continuous evaluation and optimization. At each stage of model construction, this study conducted detailed data analysis, supplemented by user interviews at certain stages, and continuously adjusted and optimized the privacy classification based on feedback. In order to ensure the dynamic adaptability and continuous improvement capabilities of the model, this study adopted an iterative approach during the construction process. At each iterative stage, this study will conduct a detailed evaluation, fully compare and analyze various data to ensure the accuracy and applicability of the model, and continuously improve the privacy classification. In addition, supplemented by user interviews at certain stages, this study was able to continuously adjust and optimize the privacy classification based on the real feedback of users on the privacy classification. For example, LDA topic modeling was used to verify the rationality of the privacy classification and ultimately determine the privacy type.

Openness Principle

The openness principle emphasizes the open feedback of users in the process of privacy information classification. Studies have shown that users’ privacy needs and expectations change over time or in different situations, so the model needs to have a certain degree of flexibility (Wong & Mulligan, Reference Wong and Mulligan2019). This study allows users to label information based on the privacy type model initially identified in this study. Users can not only label information based on the privacy type provided by the model, but also other types of privacy information. In the follow-up, this study will further understand users’ views and expectations on privacy classification through open interviews and adjust and optimize the model. This process ensures that the privacy classification model is always open to various types of data and feedback during construction.

3.3.2 Data Sampling Strategy

Dataset

This study focused on using ShareGPT-related datasets in the process of building a privacy type model. ShareGPT is a Chrome extension that supports users to share chat records with ChatGPT. The ShareGPT-CN datasetFootnote ¹ used in this study is public data based on the ShareGPT extension. ChatGPT is a typical representative of generative artificial intelligence. As a large-scale dataset containing 38,558 chat records between users and ChatGPT, it provides an excellent reference for this study to identify the privacy involved in the interaction between humans and generative artificial intelligence. Since the original chat records of ShareGPT involve multiple languages, and the follow-up of this study involves inviting some Chinese generative artificial intelligence users to annotate data, in order to facilitate user participation, the ShareGPT-CN dataset selected in this study is a version that has been translated from the original chat records into Chinese.

This dataset includes both user questions and ChatGPT answers. Each chat record between a user and ChatGPT may involve one round of conversation or multiple rounds, but each chat record is identified by a unique ID, which facilitates this study to identify a specific conversation set. It is worth mentioning that ShareGPT-related datasets have been widely used for model training and fine-tuning in previous studies, and their availability and quality have been well-proven in academic research (Mu et al., Reference Mu, Zhang, Hu, Wang, Ding, Jin, Wang, Dai, Qiao and Luo2023; Ouyang et al., Reference Ouyang, Wang, Liu, Zhong, Jiao, Iter, Pryzant, Zhu, Ji and Han2023; The Vicuna Team, 2023; H. Zhang et al., Reference Zhang, Ippolito, Lee, Jagielski, Tramer and Carlini2023; Z. Zhang et al., Reference Zhang, Jia, Lee, Yao, Das, Lerner, Wang and Li2024). The conversations in the ShareGPT-CN dataset cover a variety of life scenarios and information needs, providing a rich corpus for the privacy identification task of this study.

Preliminary Judgment of Large Language Model

In the process of building a privacy-type model, the data needs to be preliminarily screened to identify conversation data containing private information. To achieve this goal, this study used the Llama-2-7b modelFootnote ² to conduct a preliminary analysis of the conversation data set between users and ChatGPT in the ShareGPT-CN dataset. The main purpose of this step is to preliminarily screen and identify the user conversation data set through the powerful analysis capabilities of the large language model to find out the data that may contain private information. This stage not only lays the foundation for subsequent data analysis, but also effectively improves the efficiency of private information identification. The Llama2 series of large language models was released by Meta in 2023. This is a set of pre-trained and fine-tuned generative text models trained on 2 trillion tokens. Among them, Llama-2-7b has a scale of 7 billion parameters. The model is based on the Transformer architecture and is designed and trained to understand and generate natural language. The Llama model performs well in multiple natural language processing tasks, including text generation, text classification, question-answering systems, etc. These features make it an ideal choice for identifying private information.

In this study, the Llama-2-7b model is used to determine whether the conversation dataset between users and ChatGPT in the ShareGPT-CN dataset involves privacy. These conversation data contain the interactive content between users and ChatGPT, which is diverse and complex. By making a preliminary judgment on these data, the Llama-2-7b model can identify data texts containing potential privacy information, laying the foundation for subsequent detailed analysis. Based on the privacy identification instructions and the privacy information features learned during the training process, the Llama-2-7b model analyzes and judges each piece of data in the ShareGPT-CN dataset, automatically identifying and marking data that may contain privacy information. According to the recognition results of the Llama-2-7b model, the dataset is divided into two categories, one is the data judged by the model to contain privacy information, and the other is the data judged by the model to not contain privacy information. Through the preliminary judgment of Llama-2-7b, this study obtained a conversation dataset containing private information, including 3,460 chat records between users and ChatGPT, and the chat records in the remaining ShareGPT-CN dataset were classified as data that does not contain private information.

Although the Llama-2-7b model showed strong private information recognition capabilities in the preliminary judgment stage, this process still has some limitations. First, the model’s judgment depends on its training data and algorithm. For certain types of private information, the model may not accurately recognize it. Secondly, the model’s automated judgment lacks human review and may result in misjudgment. Therefore, the preliminary judgment results need to be further verified and corrected through subsequent research. In other words, the judgment of the Llama-2-7b model at this stage is only a preliminary screening and cannot be used as the final result. These data, marked as possibly containing private information, need further human review and confirmation. This study preliminarily regards the 3,460 chat records containing private information as data texts involving privacy-related users and ChatGPT, and further analyzes them in subsequent research. It can be said that these preliminary judgment results reduce the workload for subsequent experiments, making the experimental process such as user labeling more targeted and efficient. In the following subsections, this study will describe in detail the specific methods and results of how to further analyze these data to build a privacy type model. The overall research steps are shown in Figure 3.1.

A flowchart presents the overall research steps. The steps are data preparation, preliminary construction of the privacy type model, and optimization of the privacy type model. See long description.

Figure 3.1 Overall research steps

Figure 3.1Long description

The flowchart begins with 38558 conversation data in data preparation, which leads to LLM's preliminary judgment on privacy. It leads to 3460 privacy-related data and 35098 data not involving privacy. The former flows to sample 160 data to identify privacy-related contexts and an annotated data set in the preliminary construction of the privacy type model. The latter leads to 540 randomly selected data, which flows to the annotated data set. Sample 160 data to identify privacy-related contexts directs to a review of existing privacy classifications, and their comparison leads to privacy type labels for annotation. The latter, 3460 privacy-related data and annotated data set, flow to recruit users for annotation of the optimization of privacy type model. It flows to identify 3494 privacy-related data that leads to user interviews and LDA topic model. They are interlinked.

3.3.3 Privacy-related Contexts Identification

When building a privacy type model, identifying the interaction context between humans and generative AI is a crucial step. This process not only helps to more comprehensively understand the privacy needs of users in different contexts, but also provides basic support for the refined design of the privacy type model. Since this study hopes to analyze the contexts involved in the disclosure of privacy information, this study sampled the privacy information judged by the Llama-2-7b model. This study qualitatively coded 40 chat records randomly selected from 3,460 chat records containing privacy information each time. When sampling to the fourth batch, no new typical contexts appeared. Therefore, this study finally qualitatively coded the data of 160 chat records, and we believe that the coded data had reached saturation.

Qualitative coding is guided by the contextual integrity theory. When identifying the contexts involving privacy in the interaction between humans and generative AI, it focuses on the sender, subject, and type of information involved in the chat records to better judge the context of the conversation. Through the analysis of chat record data, this study extracted six types of privacy-related situations in the interaction between humans and generative AI. These situations cover the main interactions between users and generative AI in life, study, work, and other fields. The following will quote relevant texts in the data and explain the six types of situations in combination with existing research. In order to minimize the potential harm to the privacy of relevant individuals or groups in the dataset, we deleted all private information in the cited data and replaced it with the specific information type corresponding to the text.

Study and Education Consultation

Study and education consultation is one of the common scenarios in which users use generative AI. Existing studies have shown that users often use generative AI for complex learning or research that traditional search engines cannot help with, including learning new languages (Wolf & Maier, Reference Wolf and Maier2024). In this scenario, the senders of information are mainly students, who may use generative AI to assist them in improving their learning methods, acquiring academic resources, and writing relevant applications.

For example, a user proposed in a conversation with ChatGPT:

I am looking for an internship in *** (work direction) this summer. I hope to rewrite this introduction into a better summary of myself: As a *** (school name) *** (grade) *** (major) student … I am interested in using *** (expertise) to focus on *** (tool name) for *** (skills).

Daily Life Consultation

In daily life consultation scenarios, users interact with generative AI in a wide range of areas, including various matters in daily life, such as cooking suggestions, time management, travel recommendations, etc. (Dwivedi et al., Reference Dwivedi, Kshetri, Hughes, Slade, Jeyaraj, Kar, Baabdullah, Koohang, Raghavan, Ahuja, Albanna, Albashrawi, Al-Busaidi, Balakrishnan, Barlette, Basu, Bose, Brooks, Buhalis and Wright2023).

For example, a user asked in a conversation with ChatGPT:

Today, I need to pick up my child from school at *** (specific time), arrange and make a meal plan, buy groceries, write an article, shoot a YouTube video, mow the lawn, pick up my child from school at *** (specific time), and have dinner with friends at *** (specific time). Make a schedule for me. Show it in a table.

Another user asked:

Can you help me plan the activities to do during a week-long vacation in *** (specific location)? We will land in *** (specific location) at *** (specific time) and depart from *** (specific location) again at *** (specific time). We will bring *** (pets) and rent *** (rental items). We will spend most of the nights in a house in *** (specific location) and want to spend one night in *** (specific location). During this week, we want to spend time with *** (pets) on the beach and visit *** (specific location).

Financial Consumption Consultation

Financial consumption consultation mainly involves the interaction between users and generative AI in financial management and consumption decisions. Users may seek advice or information from generative AI in financial-related scenarios, such as financial management advice, budget planning, and consumption advice (Dwivedi et al., Reference Dwivedi, Kshetri, Hughes, Slade, Jeyaraj, Kar, Baabdullah, Koohang, Raghavan, Ahuja, Albanna, Albashrawi, Al-Busaidi, Balakrishnan, Barlette, Basu, Bose, Brooks, Buhalis and Wright2023).

For example, a user asked in a conversation with ChatGPT:

*** (another person in a relationship) and I are both nearly *** (age). We want to invest for our retirement. We live in *** (specific area). We have about *** (specific amount) of income each year. Is there any good investment guide? How much money do we need to invest and what methods do we choose to have enough money in retirement?

Another user asked:

I have a debt that is due in *** (specific month) for *** (specific amount). If I pay it in advance, I only have to pay *** (specific amount) today. Considering the *** (country name) financial market, which is more worthwhile? Pay in advance to get a discount? Or invest the money so that it generates income before the due date?

Health and Medical Consultation

Health and medical consultation is another important context for users to interact with generative AI. Users may consult generative AI about health issues, disease prevention, treatment plans, etc., in health and medical related contexts.

For example, a user asked in a conversation with ChatGPT:

Make me a muscle-building diet plan with *** (specific number) calories and sufficient protein. I am *** (specific height), weigh *** (specific weight), work out *** (specific time)/day in the gym, and hope to gain weight to *** (specific weight). I hope this diet plan can be applied to pre-prepared meals, which means it can be kept in the refrigerator/freezer for a week.

Another user asked:

My *** (body index) has always been above *** (specific number). Except for *** (food name), I have completely quit *** (food name). I don’t feel *** (symptoms), and my *** (body index) is usually *** (condition). But I suspect that I may not drink enough water every day. My height is *** (specific height), weight is *** (specific weight), and blood pressure is usually *** (blood pressure range). Considering all this, why does my *** (body index) still remain above *** (specific number)?

Social and Emotional Consultation

Users often seek advice or support from generative AI in social and emotional contexts. Studies have shown that users may use ChatGPT to brainstorm when giving gifts to others, or consult ChatGPT when preparing emails or greeting cards for them (Wolf & Maier, Reference Wolf and Maier2024). In such contexts, users interact with generative AI on interpersonal relationships, emotional distress, and other aspects.

For example, a user asked in a conversation with ChatGPT:

What should I buy for my lab colleague as a secret Christmas gift? The gift should be relatively cheap. He was born in *** (city name) and grew up in *** (city name). He received *** (degree) in *** (school name) and is now studying *** (major) *** (degree) in *** (school name). His research involves *** (research direction). He likes *** (music genre), especially *** (album name). He recently lost *** (item name) at the airport on his way back from *** (conference name). He likes *** (specific hobby).

Another user made a request:

Write an email in *** (language) to my *** (relationship with the user) *** (other person’s name). I am *** (own name). Explain to her in a calm and matter-of-fact way that I will no longer contact her. I will only contact her if something very serious happens to *** (relationship with the user).

Work Affairs Consultation

Users may seek advice or information in the context of career development and work task management. For example, a user asked in a conversation with ChatGPT:

My name is *** (my name), currently living in *** (city name) as *** (occupation). My best friend and business partner is *** (other person’s name), a *** (occupation). We both run a *** (group direction) group called *** (group name), working with *** (occupation) *** (other person’s name). Can you help me write a thoughtful and compassionate email to an investor interested in a startup in *** (location)?

Another user requested:

Please help write an opening statement for *** (city name) *** (court name) *** (case number). I represent the plaintiff *** (other person’s name). *** (other person’s name) was injured in *** (specific incident) and is now suing the company where *** (occupation) works.

3.3.4 Preliminary Construction of Privacy Type Model

Existing Privacy Classifications

The classification of privacy in laws, regulations, and relevant literature is an important reference for building the privacy type model. When analyzing privacy issues in the process of human interaction with generative artificial intelligence, understanding and following existing privacy classification policies can not only ensure the legality and compliance of the privacy type model, but also make it better to meet social needs.

Some laws and regulations involve specific interpretations and regulations on different types of privacy information. These laws and regulations are usually formulated by governments or certain institutions within a region to regulate the collection, storage, use, and sharing of data. The EU’s General Data Protection Regulation is one of the most stringent data protection laws in the world and has detailed requirements for the processing of personal data. The General Data Protection Regulation defines personal data as “any information related to an identified or identifiable (directly or indirectly) natural person.” China’s Personal Information Protection Law also has clear provisions on the processing of personal information, with special emphasis on the protection of sensitive personal information. It points out that sensitive personal information is “various information related to an identified or identifiable natural person recorded electronically or otherwise.” The California Privacy Rights Act of the United States has set high transparency and responsibility requirements for the collection and processing of personal data to ensure that users’ privacy rights are protected. It points out that personal sensitive information refers to information that identifies, relates to, describes a specific individual or family, can be reasonably associated with a specific individual or family directly or indirectly, or may be reasonably associated with a specific consumer or family. There are also some other countries’ relevant laws that explain privacy information. Switzerland’s Federal Data Protection Act and Brazil’s General Data Protection Act both point out that personal privacy information is information related to an identified or identifiable natural person. The above laws and regulations, as well as the US Department of Homeland Security’s “Sensitive Personally Identifiable Information Protection Manual” and the US National Institute of Standards and Technology’s “Guidelines for Confidentiality Protection of Personally Identifiable Information” all provide more specific enumeration or description of the types of information involved in privacy. In addition, a large number of academic studies have conducted detailed classification and analysis of privacy types. These studies provide theoretical support for the construction of the privacy type model. Combined with these laws, regulations and literature research, this study summarizes the existing privacy classifications (see Table 3.1).

Table 3.1Summary of existing privacy classifications

A table lists the laws and regulations and their privacy classification. See long description.

^a www.fedlex.admin.ch/eli/cc/2022/491/en

^b www.gov.cn/xinwen/2021-08/20/content_5632486.htm

^c https://thecpra.org/

^d https://lgpd-brazil.info/

^e https://gdpr.eu/tag/gdpr/

^f www.dhs.gov/publication/handbook-safeguarding-sensitive-personally-identifiable-information

^g https://csrc.nist.gov/pubs/sp/800/122/final

Table 3.1Long description

The table has two columns for laws and regulations or literature and privacy classifications.

Row 1 column 1 reads. Federal Act on Data Protection, 2023, a.

Row 2 column 2 reads. Sensitive information such as religious belief information, health information, ethnicity information, genetic information, biometric information, legal information, social assistance information, etcetera.

Row 2 column 1 reads. China’s Personal Information Protection Law, 2021 b.

Row 2 column 2 reads. Sensitive information such as biometrics, religious beliefs, specific identities, medical health, financial accounts, whereabouts, and personal information of minors under the age of fourteen.

Row 3 column 1 reads. California Privacy Rights Act, 2020, c.

Row 3 column 2 reads. Identity information, biometric information, consumption information, work information, location data, online activity information, education information, personal information related to inferred individual traits, psychological tendencies, preferences, beliefs and abilities, as well as sensitive information such as health information.

Row 4 column 1 reads. Brazilian General Data Protection Law, 2020, d.

Row 4 column 2 reads. Sensitive information such as racial information, religious belief information, social identity information, health information, genetic information, etcetera.

Row 5 column 1 reads. General Data Protection Regulation, 2018 e.

Row 5 column 2 reads. Personal identifiers, physiological and psychological, psychological information, genetic information, economic information, cultural information, social identity information, etcetera.

Row 6 column 1 reads. Handbook on the Protection of Sensitive Personally Identifiable Information, 2017, f.

Row 6 column 2 reads. Personal identification number, financial account number, biometrics, citizenship or immigration status, medical information, ethnicity or religious beliefs, personal email, address, account passwords, date of birth, criminal record, mother’s maiden name.

Row 7 column 1 reads. Guide to Protecting the Confidentiality of Personally Identifiable Information, 2010, g.

Row 7 column 2 reads. Name, personal identification number, address information, electronic asset information, telephone number, personal characteristics, information identifying an individual’s property, personal information related to or linkable to any of the above.

Row 8 column 1 reads. Chua et al. 2021.

Row 8 column 2 reads. Life behavior information, socioeconomic information, whereabouts information, financial information, authentication information, medical and health information.

Row 9 column 1 reads. Rumbold and Pierscionek, 2018.

Row 9 column 2 reads. Information about human computer interactions, demographic data, behavior, thoughts and opinions, overt individual characteristics, medical or health care data.

Row 10 column 1 reads. Milne et al. 2017.

Row 10 column 2 reads. Basic demographic data, personal preferences, contact information, social interactions, financial information, security identifiers.

Row 11 column 1 reads. Robinson, 2017.

Row 11 column 2 reads. Contact information, payment information, life history, financial or medical information, work-related information, online account information.

Row 12 column 1 reads. Finn et al. 2013.

Row 12 column 2 reads. Privacy of personal identity, privacy of behavior and actions; privacy of personal communications, privacy of data and images, privacy of thoughts and feelings; privacy of location and space, privacy of interactions, including group privacy.

Row 13 column 1 reads. Leon et al. 2013.

Row 13 column 2 reads. Browsing Information, Computer Information, Demographic Information, Location Information, Personally Identifiable Information.

Row 14 column 1 reads. Smith et al. 2011.

Row 14 column 2 reads. Employment information, identity information, consumer information, medical information, financial information, behavioral information.

Row 15 column 1 reads. Phelps et al. 2000.

Row 15 column 2 reads. Demographic characteristics, lifestyle characteristics, including media habits, shopping or purchasing habits, financial data, personal identifiers.

The note below presents the sources for a, b, c, d, e, f, and g mentioned in column 1.

It can be seen that privacy classifications in different fields and backgrounds have their own unique focus and protection needs. When constructing a privacy type model, it is necessary to comprehensively consider these classification perspectives and protection requirements to ensure the comprehensiveness and applicability of the model. Based on this, this study merged the same or similar privacy types in different laws, regulations or literature in the existing privacy classifications. Subsequently, this study identified more than three similar items as generally recognized privacy types and sorted out the types based on the information content involved in the covered similar items. Finally, this study identified nine privacy types, including online activity information, education information, health information, preference information, location information, financial consumption information, personal identity information, social relationship information, and work information.

Comparison of Existing Classifications and Identified Contexts

The construction of the privacy type model needs to refer to the existing privacy classification and combine it with specific interaction scenarios to ensure that the model can cover all common types of privacy information. This study has identified nine privacy types through legal regulations and literature analysis. Next, these privacy types will be compared with the identified actual interaction scenarios to determine whether the identified privacy types cover the common privacy information in these scenarios.

In terms of learning and education consultation, users use generative artificial intelligence to seek advice or information in learning and education-related scenarios, which may involve privacy information such as the user’s or related individuals’ educational background (such as school name, major), learning needs (such as specific learning goals, academic interests), and academic performance (such as test scores, course grades). The privacy protection requirements for this scenario are relatively high, because educational information is usually highly personal and directly related to an individual’s learning performance and potential future career development. Among the nine identified privacy types, educational information and personal identity information can better cover the privacy information involved in this scenario.

In terms of daily life consultation, this scenario may involve individual living habits (such as daily schedules), location information (such as the individual’s geographical location, places often visited), personal preferences (such as dietary preferences, travel preferences), etc. Once this information is leaked, it may cause the user or the relevant individual’s life to be disturbed or even cause security risks. Among the nine identified privacy types, location information and preference information can better cover the privacy information involved in this scenario.

In terms of financial consumption consultation, this scenario may involve the user’s financial status (such as bank accounts, investment status), consumption records (such as shopping lists, transaction records), financial planning (such as monthly budgets, long-term financial goals) information, etc. The information related to financial consumption scenarios is extremely sensitive, and any information leakage may lead to economic losses and security risks. Therefore, privacy protection in this scenario is crucial. Among the nine identified privacy types, financial consumption information and online activity information can better cover the privacy information involved in this scenario.

In terms of health and medical consultation, this scenario may involve the user’s health status (such as medical history, diagnosis results), medical records (visit records, drug use), health habits (such as exercise plans), etc. The privacy protection requirements for health information are very strict, because the disclosure of such information may have a serious impact on the privacy and mental health of users, and also involve legal and ethical issues. Among the nine identified privacy types, health information and personal identity information can better cover the privacy information involved in this context.

In terms of social-emotional counseling, this type of context may involve the user’s social relationships (such as friends, family, etc.), emotional state (such as mental health, emotional problems), private interactions (such as private communication content with others), and so on. The privacy information in this context is usually sensitive, and often involves the personal identity information of other individuals with whom the user has a social relationship, and its disclosure may have an impact on the social and emotional life of the user or the people around him. Among the nine identified privacy types, social relationship information and preference opinion information can better cover the privacy information involved in this context.

In terms of work affairs consultation, this type of scenario may involve the user’s career information (such as work experience, career planning), project details (such as project progress, task allocation), company information (such as company strategy, team structure), etc. The privacy protection of work affairs information is more important to the user’s career development and company interests, and the leakage of related information may lead to occupational risks and the leakage of commercial secrets. Among the nine identified privacy types, work information can better encompass the privacy information involved in this scenario.

It can be seen that the nine privacy types determined based on the existing privacy classification basically cover the privacy information involved in different interaction scenarios. Therefore, this study takes these nine privacy types as the core, preliminarily constructs a privacy type model, and identifies the common types of privacy information in each scenario. In the next step, the privacy type model will be optimized and verified through user labeling experiments to ensure its accuracy and effectiveness in practical applications.

3.3.5 Optimization of Privacy Type Model

User Annotation Experiment

Based on the privacy judgment results of the Llama-2-7b model on the data set, this study conducted a user labeling experiment, allowing users to browse specific conversation data and label the privacy types contained in it according to the initially constructed privacy type model. The user labeling experiment is a method of identifying private information through manual review and subjective judgment, which can make up for the shortcomings of large models in the identification of complex private information. It is a further confirmation and refinement of the preliminary judgment results of the large model, and also a verification and optimization of the initially constructed privacy type model. Through the labeling and classification of data by actual users, this study can more accurately identify and classify private information. The work at this stage will combine the subjective judgment of users and the automated analysis of large language models to further improve the accuracy and practicality of the privacy type model.

Specifically, in terms of the selection of labeled data, this study selected all data that the model believes contains private information from the judgment results of the Llama-2-7b model, involving chat records of 3,460 users with ChatGPT. At the same time, this study randomly selected 540 pieces of data from the remaining data that the model believes does not contain private information and integrated them to form 4,000 pieces of data. Subsequently, this study used the Easydata data annotation service platform provided by Baidu to recruit users to annotate these data, asking users to annotate the text in each piece of data that they believe to be private information, and using labels to classify these texts into ten categories. These ten categories include the nine privacy categories in the privacy type model initially constructed in this study. At the same time, in order to prevent the existence of information that some users believe to be private but does not belong to the nine privacy categories in the privacy type model, this study added a tenth category to the annotation label, namely other privacy-related information.

This study recruited twelve users with experience in using generative artificial intelligence from college students. These users came from different majors and grades, covering students who are pursuing bachelor’s degrees, master’s degrees and doctoral degrees, with diverse professional backgrounds and perspectives. They use generative artificial intelligence almost every day, indicating that they have relatively rich experience in using generative artificial intelligence and are a potential group that discloses privacy in the process of interacting with it. Before officially starting the annotation experiment, it was necessary to conduct systematic training for the participating users. Therefore, after the recruitment, this study trained the users who participated in the annotation, introduced the platform used for annotation, explained the specific annotation process, and provided detailed privacy type classification instructions and examples to ensure that they understood the definition of privacy and the connotation of the ten labels involved in the annotation, and had a unified understanding and judgment in the annotation process.

Subsequently, this study divided the twelve users into three groups, with four users in each group, and assigned the same 4,000 data to users in different groups. Each user in each group was responsible for annotating 1,000 data. Through such data allocation, each piece of data was independently annotated by three users, which facilitated cross-validation based on the annotation results of multiple people to improve the accuracy and reliability of the annotation results. During the annotation process, users reviewed and annotated the assigned data one by one. The annotation work mainly involves two steps. The first step is to determine whether a piece of data contains privacy information. The second step is to further annotate the specific type of privacy according to the preset label if it contains privacy information.

Annotation Result Analysis

After completing the user labeling experiment, this study analyzed the labeling results. By combining user labeling and the judgment results of the large model, a basis can be provided for the optimization of the privacy type model. First, this study sorted out the labeling results of three users who labeled the same data and recorded whether each data contained private information and the specific privacy category. The labeling results of the three users were compared with the judgment results of the large model, and the consistency was calculated. If three or more of the four results (the judgment results of the large model and the labeling results of the three users) believed that the data involved private information, it was determined to be data containing privacy, with a total of 3,494 pieces. Subsequently, for the data containing private information, the specific privacy categories involved and the distribution of each category of private information were counted to understand the proportion and frequency of different categories of private information in the data set. For example, this study found that work information and personal identity information were categories with higher frequency, while health information and financial consumption information were relatively rare. At the same time, this study analyzed the consistency of user labeling results to understand the degree of consistency of users in labeling different categories of private information. In the annotation of work information and health information, the consistency is high, while in the annotation of preference opinion information and online activity information, the consistency is low, indicating that these categories of privacy information are more subjective and complex.

After completing the systematic analysis of the annotation results, this study entered the verification and adjustment stage of the privacy type model. The main goal of this stage is to optimize and refine the privacy categories based on user feedback and LDA topic modeling analysis, so as to build a more accurate, comprehensive, and practical privacy type model. Based on the analysis results of the annotation data, this study first conducted user interviews to gain an in-depth understanding of the users’ specific experiences, problems encountered, and suggestions during the annotation process. The purpose of this link is to further improve and optimize the privacy type model through the subjective feedback of users.

Through the integration and analysis of user feedback, this study identified the following aspects that need to be adjusted. First, some users proposed improvement suggestions for the names and definitions of some privacy categories by replacing the wording. For example, the “opinion” in the category of preference opinion information cannot well summarize future plans, personal beliefs, and other contents; the personal belongings mentioned in some data are closely related to financial consumption information, but it is difficult to simply summarize them with the category of financial consumption information. Second, users also pointed out that they found some privacy information categories that were not covered by the nine privacy type labels, and they classified them into other privacy-related information. For example, some data mentioned the human or natural environment around them. Based on this, this study adjusted the nine privacy categories in the privacy type model to work information, location environment information, health information, preference and thought information, online activity information, social relationship information, personal identity information, property consumption information, and education information.

In order to further verify and optimize the privacy classification model, this study introduced the Latent Dirichlet Allocation (LDA) topic modeling technology. LDA topic modeling is a commonly used text analysis method that can effectively classify and summarize large-scale text data by identifying potential topics in text. First, this study selected 3,494 data that were finally determined to contain privacy information and performed data preprocessing, including text cleaning and stop word removal. Subsequently, this study used the LDA topic modeling algorithm to train the preprocessed data to determine the optimal number of topics.

This study determined that the optimal number of topics was nine, through consistency analysis, as shown in Figure 3.2. Finally, this study parsed the topics output by the LDA model to identify the main content of each topic. The main words in the nine topic categories output by the LDA model are shown in Table 3.2. Figure 3.3 shows an example of visualization of high-frequency words in Topic 1.

A line graph of coherence varies with the number of topics and plots a fluctuation line. See long description.

Figure 3.2 Relationship between coherence score and number of topics modeled by LDA

Figure 3.2Long description

The graph plots a fluctuating line for the coherence score versus the number of topics. The x-axis representing the coherence score ranges from 0.40 to 0.46. The y-axis represents the number of topics ranging from 6 to 14. The values are (5, 0.443), (6, 0.428), (7, 0.401), (8, 0.435), (9, 0.459), (10, 0.445), (11, 0.433), (12, 0.433), (13, 0.427), (14, 0.424), and (15, 0.422). Note: All values are approximated.

Table 3.2LDA topic modeling results
Topic	High frequency words	Corresponding privacy type
Topic 1	Website, Email, Campaign, Page, Test, Tip, Platform, Marketing, Tool, Database	Online activity information
Topic 2	Students, courses, learning, research, conferences, attending, professional, community, education, teachers	Education information
Topic 3	Exercise, doctor, health, pain, muscle, suffer, treatment, care, calories, medicine	Health information
Topic 4	Idea, video, ad, like, model, campaign, title, feature, type, language	Preference and thought information
Topic 5	Travel, miles, attractions, solar, path, meals, immigration, company, history, menu	Location and environment information
Topic 6	Pay, assess, fees, services, items, legal, categories, check, style, unique	Property and consumption information
Topic 7	Name, industry, scene, weekly, birthday, weight, children, establishment, Muslim, email	Personal identity information
Topic 8	Media, social, client, account, form, Instagram, friends, post, script, designer	Social relationship information
Topic 9	Work, company, team, project, product, position, sales, manager, management, profession	Work information

An illustration for L D A model for privacy text consists of an intertopic distance map and an onion plot. See long description.

Figure 3.3(a)Long description

The intertopic distance map via multidisciplinary scaling. The vertical and horizontal axes mark PC 2 and PC 1, respectively. The circles plotted are numbered from 1 to 9, which are from bigger to smaller. Circles 1 and 2, 3 and 5, 4 and 8, and 8 and 9 intersect. An onion diagram for marginal topic distribution marks 2, 5, and 10 percent from the inner to the outer.

A horizontally stacked bar graph explains the L D A model for privacy text. See long description.

Figure 3.3(b)Long description

A horizontally stacked bar graph for the top 30 most relevant terms for topic 1 plots bars for overall term frequency and estimated term frequency within the selected topic. The topics are work, company, team, experience, position, project, product, software, study, application, sales, design, manager, client, technology, user, profession, manage, app, service, interview, code, video, student, recruitment, engineer, skill, interest, occupation, and send. It follows a fluctuating trend with work having the highest value of 800 and interest having the lowest value of 70. A scale at the top marks the relevance metric as 0.8. Left bottom.

Figure 3.3 Visualization results of LDA model for privacy text

It can be seen that the results of the LDA topic modeling analysis are consistent with the privacy types determined in this study, both in terms of quantity and the main content involved in each topic. Through this series of verification steps, this study finally determined the optimized privacy type model, as shown in Figure 3.4, which provides a solid foundation for privacy protection in the interaction between humans and generative artificial intelligence.

A circle diagram of the privacy-type model marks nine sections for various information. See long description.

Figure 3.4 Privacy type model

Figure 3.4Long description

The nine sections of the circle diagram mark social relationship information, preference and thought information, location and environment information, property and construction information, online activity information, health information, personal identity information, education information, and work information. Social and emotional consultation covers social relationship information and preference and thought information. Daily life consultation covers preference and thought information, and location and environment information. Financial consumption consultation covers online activity information and property and consumption information. Health and medical consultation covers health information and personal identity information. Study and education consultation covers education information and personal identity information. Work affairs consultation covers work information.

3.4 Application Value of Privacy Type Model

The application value of the privacy type model in the interaction between humans and generative AI is reflected in many aspects, especially in the fields of knowledge sharing, search, misinformation processing, health information management, and scientific discovery. With the widespread application of generative AI in these fields, privacy protection issues have become increasingly important. The privacy type model can not only help developers and users better understand and deal with privacy risks but also improve the credibility and user satisfaction of generative AI. This section will discuss in detail the potential application value of the privacy type model in the discussed fields.

3.4.1 Application Value in the Field of Knowledge Sharing

On knowledge sharing platforms, generative AI is often used for personalized knowledge recommendation, providing customized knowledge content by analyzing user behavior data and interest preferences. Knowledge sharing platforms bring together a large amount of user-generated content and interaction records, which are very important for providing personalized services and improving platform functions. However, these data also contain users’ personal information and learning habits, which may lead to privacy leakage if not handled properly. The privacy type model can help platforms identify and classify these data, take different protection measures according to their sensitivity, and ensure user privacy security. For example, the model can distinguish between general behavior data and data containing personal identity information and take more stringent protection measures for the latter.

In the intelligent question-answering system, users may ask questions involving personal privacy, such as career planning and health issues. The privacy type model can monitor and identify this sensitive information in real-time, remind users in time, and take protective measures to avoid privacy leakage. At the same time, the platform can formulate corresponding privacy policies and user education content based on the privacy type model to enhance users’ privacy protection awareness.

In addition, AI-supported crowdsourcing platforms promote knowledge sharing by gathering the wisdom of the masses. However, when participants contribute knowledge, they often need to provide personal information, which may lead to privacy leakage. The privacy type model can help identify and classify different types of privacy data to ensure that the privacy of participants is effectively protected during the knowledge sharing process. For example, in a crowdsourcing platform, the privacy type model can help automatically identify sensitive information, such as personal identity information and preference information, and take corresponding protective measures. At the same time, motivation plays an important role in crowdsourcing platforms, driving participants to actively contribute knowledge. The privacy type model can help understand and manage the privacy motivations of participants. For example, some participants may want to remain anonymous when sharing knowledge, while others may be willing to disclose their identities to obtain more reputation rewards. By applying the privacy type model, crowdsourcing platforms can provide personalized privacy protection measures based on the privacy preferences of different participants, thereby enhancing the enthusiasm and participation of participants.

3.4.2 Application Value in the Field of Information Search

The application of generative AI in search engines can provide personalized search results based on the user’s search history and behavior data. These data usually contain the user’s search preferences and interests, which may lead to privacy leakage if not handled properly.

Specifically, search engines collect a large amount of user search behavior data, which is very important for optimizing search algorithms and improving search quality. However, these data also contain a large amount of personal privacy information. The privacy type model can help search engines identify and classify these data, and take different protection measures according to their sensitivity to ensure user privacy security. For example, the model can help distinguish between general search data and data containing personal preferences. For search behavior data involving personal preferences, stricter data encryption and access control measures can be adopted.

The query terms entered by users on search engines may sometimes contain sensitive information, such as medical conditions, financial information, etc. The privacy type model can help identify these sensitive query terms and take corresponding privacy protection measures in the search results. For example, for query terms involving medical conditions, anonymized search results can be provided to avoid exposing users’ privacy information.

3.4.3 Application Value in the Field of Misinformation Processing

The application of generative AI in misinformation detection can help identify and filter false information. However, during the detection process, it may be necessary to analyze a large amount of user data, including user-published content and interaction records. These data usually contain users’ personal information and social relationships, which may lead to privacy leakage if not handled properly. Privacy type models can help identify and classify these data to ensure that sensitive information is properly protected. For example, the model can help distinguish between general published content and content containing personally identifiable information and take stricter protection measures for the latter.

When tracing the chain of false information propagation, it may be necessary to analyze interaction records and propagation paths involving multiple users. These data usually contain users’ social relationships and interactive behaviors. Privacy type models can help identify and classify these data, and take different protection measures according to their sensitivity to ensure user privacy security. For example, for data involving social relationships, stricter data encryption and access control measures can be adopted.

In the process of correcting misinformation, generative AI may need to interact with users to collect user feedback and opinions. These interactive data usually contain users’ personal information and opinions, which may lead to privacy leakage if not handled properly. Privacy type models can help distinguish between general feedback data and data containing personally identifiable information and take stricter protection measures for the latter.

In addition, on social media, AI can identify and handle false information through sentiment analysis. However, sentiment analysis usually requires the collection and analysis of a large amount of user data, which brings privacy risks. Privacy-type models can help AI systems protect user privacy when performing sentiment analysis. For example, privacy-type models can identify users’ sensitive emotional expressions and anonymize these expressions to prevent user privacy leaks.

3.4.4 Application Value in the Field of Health Information Management

The application of generative AI in the field of health information includes online health consultation, disease prediction, etc. These applications involve a large amount of patient privacy data, such as medical history, diagnosis results, and treatment plans. Privacy type models can help identify and classify this data.

Specifically, in telemedicine services, doctors and patients may communicate in real-time through generative AI to provide medical advice and consulting services. In this case, patients may inadvertently disclose private information during the conversation. Privacy type models can analyze the content of the conversation in real-time, identify potential privacy risks, and warn or take protective measures when necessary to protect the privacy of patients.

When providing nursing services, caregivers usually need to obtain patients’ health information. Generative AI can help optimize nursing services by analyzing the information behavior of caregivers. However, this also means that the operation behavior of caregivers and the health data of patients may be collected and analyzed, bringing privacy risks. Privacy type models can help identify and manage privacy data in the information behavior of caregivers, ensuring that user privacy is effectively protected during the information behavior analysis process.

In addition, medical institutions and research institutions usually collect a large amount of health data for scientific research and public health monitoring. However, these data also contain a large amount of personal privacy information, which may lead to privacy leakage if not handled properly. The privacy type model can help identify and classify this data, and take different protection measures according to its sensitivity to ensure patient privacy. For example, for health data involving personally identifiable information, stricter data encryption and access control measures can be adopted.

3.4.5 Application Value in the Field of Scientific Discovery

The application of generative AI in scientific research includes data analysis, model prediction, and so on. These applications usually involve a large amount of research data and experimental records, which may contain personal privacy information of research participants. Specifically, scientific research usually requires sharing research data to promote cooperation and verification of results. However, these shared data also contain a large amount of personal privacy information, which may lead to privacy leakage if not handled properly. Privacy type models can help identify and classify these data and take different protection measures according to their sensitivity. For example, for shared data involving personal identity information, data anonymization and strict access control measures can be adopted. When disseminating scientific research results, researchers may need to display and discuss data and results involving personal privacy information. Privacy type models can help identify and classify these data to ensure that sensitive information is properly protected. For example, the model can help researchers shield or anonymize personal privacy information during the presentation process to ensure the privacy security of research participants.

In addition, when conducting scientific research collaboration, researchers usually need to share and communicate scientific research data. AI can help optimize the scientific research collaboration process by analyzing scientific research collaboration behavior. However, this also means that the collaborative behavior and scientific research data of researchers may be collected and analyzed, bringing privacy risks. Privacy type models can help identify and manage privacy data in scientific research collaboration and ensure that privacy is effectively protected during the collaboration process.

In addition to its application value in specific fields, privacy type models also have general application directions, including personalized privacy protection and cross-domain privacy protection. With the development of AI technology, personalized privacy protection has become possible. Privacy type models can help understand and manage users’ privacy preferences and provide personalized privacy protection measures. For example, privacy type models can dynamically adjust the encryption level and anonymization degree of data according to users’ privacy preferences to provide personalized privacy protection services. With the widespread application of AI in different fields, cross-domain privacy protection has become increasingly important. Privacy type models can help identify and manage data privacy in different fields and provide a unified privacy protection framework. For example, privacy type models can formulate general privacy protection strategies and technical means based on the characteristics of data in different fields to ensure that data is effectively protected during cross-domain applications.

In summary, the application value of privacy type models in generative artificial intelligence and human interaction is reflected in many aspects. By applying privacy type models in fields such as knowledge sharing, information search, error information processing, health information management, and scientific discovery, different types of data can be effectively identified and classified. With the continuous development of generative AI technology, privacy-type models can be applied and promoted in more fields, providing users with safer and more reliable services.

3.5 Summary

This chapter explores the privacy issues in the interaction between humans and generative AI and proposes a systematic privacy type model, which aims to provide guidance for developers and users to ensure user privacy security. By analyzing the sources and operating mechanisms of privacy leakage in detail, the key issues in privacy protection are identified, and the existing privacy protection methods and their shortcomings are introduced. On this basis, the establishment of the privacy type model provides a theoretical basis and practical guidance for the privacy protection of generative AI in different application fields.

The application value of the privacy type model is reflected in many aspects, including knowledge sharing, information search, misinformation processing, health information management and scientific discovery. In these fields, the privacy type model can help identify and classify different types of data and provide reference for taking corresponding privacy protection measures, thereby reducing the risk of privacy leakage and improving the credibility and user satisfaction of generative AI. The privacy type model not only provides a systematic privacy protection framework for developers and users, but also lays a solid foundation for the healthy development of generative AI.

However, there are still many challenges in privacy protection. First of all, privacy protection needs to balance the relationship between data utilization and user privacy. Finding a balance between providing personalized services and protecting user privacy is a key issue in privacy protection. Secondly, with the development of technology, privacy protection technology also needs to be continuously updated and improved to cope with new privacy threats. Finally, privacy protection requires the joint efforts of all sectors of society, including the participation of policymakers, technology developers, users, etc., to jointly promote the development of privacy protection.

Future research can further explore the specific implementation methods and effect evaluation of the privacy type model in different application scenarios, explore the combination of the privacy type model with other privacy protection technologies, and improve the overall level of privacy protection. At the same time, it is also necessary to strengthen the cultivation of user privacy protection awareness, guide users to actively participate in privacy protection, and jointly build a safe and reliable generative AI application environment. In short, the proposal and application of the privacy type model further clarify the key information types involved in privacy protection, provide a reference for privacy protection in the interaction between humans and generative artificial intelligence, and provide strong support for the healthy development of generative AI.

Footnotes

^a www.fedlex.admin.ch/eli/cc/2022/491/en

^b www.gov.cn/xinwen/2021-08/20/content_5632486.htm

^c https://thecpra.org/

^d https://lgpd-brazil.info/

^e https://gdpr.eu/tag/gdpr/

^f www.dhs.gov/publication/handbook-safeguarding-sensitive-personally-identifiable-information

^g https://csrc.nist.gov/pubs/sp/800/122/final

¹ https://huggingface.co/datasets/FreedomIntelligence/ShareGPT-CN

² https://llama.meta.com/llama2/

References

Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., & Zhang, L. (2016). Deep Learning with Differential Privacy. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 308–318. https://doi.org/10.1145/2976749.2978318 CrossRef Google Scholar

Apthorpe, N., Varghese, S., & Feamster, N. (2019). Evaluating the Contextual Integrity of Privacy Regulation: Parents’ {IoT} Toy Privacy Norms Versus {COPPA}, 123–140. www.usenix.org/conference/usenixsecurity19/presentation/apthorpe Google Scholar

Baidoo-anu, D., & Ansah, L. O. (2023). Education in the Era of Generative Artificial Intelligence (AI): Understanding the Potential Benefits of ChatGPT in Promoting Teaching and Learning. Journal of AI, 7(1), 52–62. https://doi.org/10.61969/jai.1337500 CrossRef Google Scholar

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html Google Scholar

Carlini, N., Tramèr, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, Ú., Oprea, A., & Raffel, C. (2021). Extracting Training Data from Large Language Models. 30th USENIX Security Symposium, 2633–2650. www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting Google Scholar

Cascella, M., Montomoli, J., Bellini, V., & Bignami, E. (2023). Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios. Journal of Medical Systems, 47(1), 33. https://doi.org/10.1007/s10916–023-01925-4 CrossRef Google Scholar PubMed

Chang, D., Krupka, E. L., Adar, E., & Acquisti, A. (2016). Engineering Information Disclosure: Norm Shaping Designs. Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, 587–597. https://doi.org/10.1145/2858036.2858346 CrossRef Google Scholar

Chen, L., Sun, L., & Han, J. (2023). A Comparison Study of Human and Machine-Generated Creativity. Journal of Computing and Information Science in Engineering, 23(051012). https://doi.org/10.1115/1.4062232 CrossRef Google Scholar

Choi, J. H., Hickman, K. E., Monahan, A. B., & Schwarcz, D. (2021). ChatGPT Goes to Law School. Journal of Legal Education, 71, 387.Google Scholar

Chua, H. N., Ooi, J. S., & Herbland, A. (2021). The Effects of Different Personal Data Categories on Information Privacy Concern and Disclosure. Computers & Security, 110, 102453. https://doi.org/10.1016/j.cose.2021.102453 CrossRef Google Scholar

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A Large-scale Hierarchical Image Database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255. https://doi.org/10.1109/CVPR.2009.5206848 CrossRef Google Scholar

Dockhorn, T., Cao, T., Vahdat, A., & Kreis, K. (2023). Differentially Private Diffusion Models [arXiv preprint]. arXiv:2210.09929. https://doi.org/10.48550/arXiv.2210.09929 CrossRef Google Scholar

Dwivedi, Y. K., Kshetri, N., Hughes, L., Slade, E. L., Jeyaraj, A., Kar, A. K., Baabdullah, A. M., Koohang, A., Raghavan, V., Ahuja, M., Albanna, H., Albashrawi, M. A., Al-Busaidi, A. S., Balakrishnan, J., Barlette, Y., Basu, S., Bose, I., Brooks, L., Buhalis, D., … Wright, R. (2023). Opinion Paper: “So What if ChatGPT Wrote It?” Multidisciplinary Perspectives on Opportunities, Challenges and Implications of Generative Conversational AI for Research, Practice and Policy. International Journal of Information Management, 71, 102642. https://doi.org/10.1016/j.ijinfomgt.2023.102642 CrossRef Google Scholar

Dwork, C., & Roth, A. (2014). The Algorithmic Foundations of Differential Privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4), 211–407. https://doi.org/10.1561/0400000042 CrossRef Google Scholar

Estrada, S. (2023, March 1). A Startup CFO used ChatGPT to Build an FP&A tool: Here’s How It Went. Fortune. https://fortune.com/2023/03/01/startup-cfo-chatgpt-finance-tool/Google Scholar

Finn, R. L., Wright, D., & Friedewald, M. (2013). Seven Types of Privacy. In Gutwirth, S., Leenes, R., de Hert, P., & Poullet, Y. (eds.), European Data Protection: Coming of Age (pp. 3–32). Springer Netherlands. https://doi.org/10.1007/978-94-007-5170-5_1 CrossRef Google Scholar

Fire, M., Goldschmidt, R., & Elovici, Y. (2014). Online Social Networks: Threats and Solutions. IEEE Communications Surveys & Tutorials, 16(4), 2019–2036. https://doi.org/10.1109/COMST.2014.2321628 CrossRef Google Scholar

Golda, A., Mekonen, K., Pandey, A., Singh, A., Hassija, V., Chamola, V., & Sikdar, B. (2024). Privacy and Security Concerns in Generative AI: A Comprehensive Survey. IEEE Access, 12, 48126–48144. https://doi.org/10.1109/ACCESS.2024.3381611 CrossRef Google Scholar

Jancke, G., Venolia, G. D., Grudin, J., Cadiz, J. J., & Gupta, A. (2001). Linking Public Spaces: Technical and Social Issues. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 530–537. https://doi.org/10.1145/365024.365352 CrossRef Google Scholar

Kelley, P. G., Cesca, L., Bresee, J., & Cranor, L. F. (2010). Standardizing Privacy Notices: An Online Study of the Nutrition Label Approach. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 1573–1582. https://doi.org/10.1145/1753326.1753561 CrossRef Google Scholar

Kimmel, D. (2023, May 16). ChatGPT Therapy Is Good, But It Misses What Makes Us Human. Columbia University Department of Psychiatry. www.columbiapsychiatry.org/news/chatgpt-therapy-is-good-but-it-misses-what-makes-us-human Google Scholar

Kshetri, N. (2023). Cybercrime and Privacy Threats of Large Language Models. IT Professional, 25(3), 9–13. https://doi.org/10.1109/MITP.2023.3275489 CrossRef Google Scholar

Leon, P. G., Ur, B., Wang, Y., Sleeper, M., Balebako, R., Shay, R., Bauer, L., Christodorescu, M., & Cranor, L. F. (2013). What Matters to Users? Factors That Affect Users’ Willingness to Share Information with Online Advertisers. Proceedings of the Ninth Symposium on Usable Privacy and Security, 1–12. https://doi.org/10.1145/2501604.2501611 CrossRef Google Scholar

Leonard, A. (2023, September 16). “Dr. Google” Meets Its Match in Dr. ChatGPT. NPR. www.npr.org/sections/health-shots/2023/09/16/1199924303/chatgpt-ai-medical-advice Google Scholar

Li, J., Cao, H., Lin, L., Hou, Y., Zhu, R., & El Ali, A. (2024). User Experience Design Professionals’ Perceptions of Generative Artificial Intelligence. Proceedings of the CHI Conference on Human Factors in Computing Systems, 1–18. https://doi.org/10.1145/3613904.3642114 CrossRef Google Scholar

Liu, J., Lau, C. P., & Chellappa, R. (2023). DiffProtect: Generate Adversarial Examples with Diffusion Models for Facial Privacy Protection [arXiv preprint]. arXiv:2305.13625. https://doi.org/10.48550/arXiv.2305.13625 CrossRef Google Scholar

Lu, L., McDonald, C., Kelleher, T., Lee, S., Chung, Y. J., Mueller, S., Vielledent, M., & Yue, C. A. (2022). Measuring Consumer-perceived Humanness of Online Organizational Agents. Computers in Human Behavior, 128, 107092. https://doi.org/10.1016/j.chb.2021.107092 CrossRef Google Scholar

Maddison, L. (2023, April 4). Samsung Workers Made a Major Error by Using ChatGPT. TechRadar. www.techradar.com/news/samsung-workers-leaked-company-secrets-by-using-chatgpt Google Scholar

Milne, G. R., Pettinico, G., Hajjat, F. M., & Markos, E. (2017). Information Sensitivity Typology: Mapping the Degree and Type of Risk Consumers Perceive in Personal Data Sharing. Journal of Consumer Affairs, 51(1), 133–161. https://doi.org/10.1111/joca.12111 CrossRef Google Scholar

Mu, Y., Zhang, Q., Hu, M., Wang, W., Ding, M., Jin, J., Wang, B., Dai, J., Qiao, Y., & Luo, P. (2023). EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought. Advances in Neural Information Processing Systems, 36, 25081–25094. https://proceedings.neurips.cc/paper_files/paper/2023/hash/4ec43957eda1126ad4887995d05fae3b-Abstract-Conference.html Google Scholar

Narayanan, A., & Shmatikov, V. (2008). Robust De-anonymization of Large Sparse Datasets. 2008 IEEE Symposium on Security and Privacy (Sp 2008), 111–125. https://doi.org/10.1109/SP.2008.33 Google Scholar

Nissenbaum, H. (2004). Privacy as Contextual Integrity Symposium: Technology, Values, and the Justice System. Washington Law Review, 79(1), 119–158.Google Scholar

Ouyang, S., Wang, S., Liu, Y., Zhong, M., Jiao, Y., Iter, D., Pryzant, R., Zhu, C., Ji, H., & Han, J. (2023). The Shifted and The Overlooked: A Task-oriented Investigation of User-GPT Interactions [arXiv preprint]. arXiv:2310.12418. https://doi.org/10.48550/arXiv.2310.12418 CrossRef Google Scholar

Peris, C., Dupuy, C., Majmudar, J., Parikh, R., Smaili, S., Zemel, R., & Gupta, R. (2023). Privacy in the Time of Language Models. Proceedings of the 16th ACM International Conference on Web Search and Data Mining, 1291–1292. https://doi.org/10.1145/3539597.3575792 CrossRef Google Scholar

Phelps, J., Nowak, G., & Ferrell, E. (2000). Privacy Concerns and Consumer Willingness to Provide Personal Information. Journal of Public Policy & Marketing, 19(1), 27–41. https://doi.org/10.1509/jppm.19.1.27.16941 CrossRef Google Scholar

Porter, J. (2023, November 6). ChatGPT Continues to Be One of the Fastest-Growing Services Ever. The Verge. www.theverge.com/2023/11/6/23948386/chatgpt-active-user-count-openai-developer-conference Google Scholar

Robinson, C. (2017). Disclosure of Personal Data in Ecommerce: A Cross-national Comparison of Estonia and the United States. Telematics and Informatics, 34(2), 569–582. https://doi.org/10.1016/j.tele.2016.09.006 Google Scholar

Rumbold, J. M. M., & Pierscionek, B. K. (2018). What Are Data? A Categorization of the Data Sensitivity Spectrum. Big Data Research, 12, 49–59. https://doi.org/10.1016/j.bdr.2017.11.001 CrossRef Google Scholar

Shalaby, W., Arantes, A., GonzalezDiaz, T., & Gupta, C. (2020). Building Chatbots from Large Scale Domain-specific Knowledge Bases: Challenges and Opportunities. 2020 IEEE International Conference on Prognostics and Health Management (ICPHM), 1–8. https://doi.org/10.1109/ICPHM49022.2020.9187036 CrossRef Google Scholar

Shvartzshnaider, Y., Tong, S., Wies, T., Kift, P., Nissenbaum, H., Subramanian, L., & Mittal, P. (2016). Learning Privacy Expectations by Crowdsourcing Contextual Informational Norms. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 4, 209–218. https://doi.org/10.1609/hcomp.v4i1.13271 CrossRef Google Scholar

Smith, H. J., Dinev, T., & Xu, H. (2011). Information Privacy Research: An Interdisciplinary Review. MIS Quarterly, 35(4), 989–1015. https://doi.org/10.2307/41409970 CrossRef Google Scholar

Song, C., Ristenpart, T., & Shmatikov, V. (2017). Machine Learning Models that Remember Too Much. Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 587–601. https://doi.org/10.1145/3133956.3134077 CrossRef Google Scholar

Stokel-Walker, C. (2023). AI Chatbots Are Coming to Search Engines: Can You Trust the Results? Nature. https://doi.org/10.1038/d41586–023-00423-4 CrossRef Google Scholar

Van Dis, E. A. M., Bollen, J., Zuidema, W., van Rooij, R., & Bockting, C. L. (2023). ChatGPT: Five Priorities for Research. Nature, 614(7947), 224–226. https://doi.org/10.1038/d41586-023-00288-7 CrossRef Google Scholar PubMed

The Vicuna Team. (2023, March 30). Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. LMSYS Org. https://lmsys.org/blog/2023-03-30-vicuna Google Scholar

Wolf, V., & Maier, C. (2024). ChatGPT Usage in Everyday Life: A Motivation-Theoretic Mixed-methods Study. International Journal of Information Management, 79, 102821. https://doi.org/10.1016/j.ijinfomgt.2024.102821 CrossRef Google Scholar

Wong, R. Y., & Mulligan, D. K. (2019). Bringing Design to the Privacy Table: Broadening “Design” in “Privacy by Design” Through the Lens of HCI. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1–17. https://doi.org/10.1145/3290605.3300492 CrossRef Google Scholar

Wu, D., & Sun, G. (2023). The Credibility of the Results of Generative Intelligent Search. Journal of Library Science in China, 49(6), 51–67. https://doi.org/10.13530/j.cnki.jlis.2023048 Google Scholar

Zhang, C., Ippolito, D., Lee, K., Jagielski, M., Tramer, F., & Carlini, N. (2023). Counterfactual Memorization in Neural Language Models. Advances in Neural Information Processing Systems, 36, 39321–39362.Google Scholar

Zhang, H., Chen, J., Jiang, F., Yu, F., Chen, Z., Li, J., Chen, G., Wu, X., Zhang, Z., Xiao, Q., Wan, X., Wang, B., & Li, H. (2023). HuatuoGPT, towards Taming Language Model to Be a Doctor [arXiv preprint]. arXiv:2305.15075. https://doi.org/10.48550/arXiv.2305.15075 CrossRef Google Scholar

Zhang, Z., Jia, M., Lee, H.-P. (Hank), Yao, B., Das, S., Lerner, A., Wang, D., & Li, T. (2024). “It’s a Fair Game,” or Is It? Examining How Users Navigate Disclosure Risks and Benefits When Using LLM-Based Conversational Agents. Proceedings of the CHI Conference on Human Factors in Computing Systems, 1–26. https://doi.org/10.1145/3613904.3642385 CrossRef Google Scholar

Figure 3.1 Overall research stepsFigure 3.1 long description.

Table 3.1 Summary of existing privacy classificationsTable 3.1 long description.

Figure 3.2 Relationship between coherence score and number of topics modeled by LDAFigure 3.2 long description.

Table 3.2 LDA topic modeling results

Figure 3.3(a) Figure 3.3(a) long description.

Figure 3.3(b) Figure 3.3(b) long description.

Figure 3.4 Privacy type modelFigure 3.4 long description.

Accessibility standard: Unknown

Accessibility compliance for the HTML of this book is currently unknown and may be updated in the future.

Book contents

3 - Privacy Identification of Human–Generative AI Interaction

Summary

Keywords

Information

3.1 Introduction

3.2 Related Concepts

3.2.1 Privacy Leakage in Generative AI

Data Source

Operation Mechanism

The Impact of Privacy Leakage

3.2.2 Privacy Protection Methods

Restriction of Data

Privacy by Design

The Shortcomings of Privacy Protection Methods

3.2.3 Context Integrity Theory

3.3 Construction of Privacy Type Model

3.3.1 Construction Principles of Privacy Type Model

Contextual Principle

Developmental Principle

Openness Principle

3.3.2 Data Sampling Strategy

Dataset

Preliminary Judgment of Large Language Model

3.3.3 Privacy-related Contexts Identification

Study and Education Consultation

Daily Life Consultation

Financial Consumption Consultation

Health and Medical Consultation

Social and Emotional Consultation

Work Affairs Consultation

3.3.4 Preliminary Construction of Privacy Type Model

Existing Privacy Classifications

Comparison of Existing Classifications and Identified Contexts

3.3.5 Optimization of Privacy Type Model

User Annotation Experiment

Annotation Result Analysis

3.4 Application Value of Privacy Type Model

3.4.1 Application Value in the Field of Knowledge Sharing

3.4.2 Application Value in the Field of Information Search

3.4.3 Application Value in the Field of Misinformation Processing

3.4.4 Application Value in the Field of Health Information Management

3.4.5 Application Value in the Field of Scientific Discovery

3.5 Summary

Footnotes

References

Accessibility standard: Unknown

Save book to Kindle

Save book to Dropbox

Save book to Google Drive