9.1 Introduction
Patients living with serious and/or clinical diseases such as cancer or Alzheimer’s disease and related dementias (ADRD) require extensive care, often by family members (Alzheimer’s Association, 2022; Kent et al., Reference Kent, Rowland, Northouse, Litzelman, Chou, Shelburne, Timura, O’Mara and Huss2016). This can be challenging and, thus, stressful for family caregivers because they do not have formal training on patient care (Reid & O’Brien, Reference Reid and O’Brien2021). Although they can seek advice from healthcare professionals during visits, it is often not sufficient as caregiving questions and needs can happen at any time (Peterson et al., Reference Peterson, Hahn, Lee, Madison and Atri2016; Soong et al., Reference Soong, Au, Kyaw, Theng and Car2020), and current forms of advice are static and not tailored to caregivers’ specific needs (González-Fraile et al., Reference González-Fraile, Ballesteros, Rueda, Santos-Zorrozúa, Solà and McCleery2021).
Unprepared caregivers often seek information and support online to fulfill their changing needs (Reifegerste et al., Reference Reifegerste, Meyer, Zwitserlood and Ullman2021). As caregivers increasingly take on more active roles in health care, the Internet has become a prominent source of health information to guide their decision-making and self-management (Zhao et al., Reference Zhao, Zhao and Song2022). Yet, despite the great potential of the Internet, about 63 percent of users who seek health information on the web have reported feeling overwhelmed by the vast amount of unfiltered information and unqualified to determine the quality, veracity, and relevance of the information (Ferraris et al., Reference Ferraris, Monzani, Coppini, Conti, Pizzoli, Grasso and Pravettoni2023). Reasons behind this finding include that the search for health information is different from general information-seeking behaviors; looking for health information requires caregivers to master certain domain-specific knowledge, especially when they encounter resources full of domain-specific terminology (Chi et al., Reference Chi, He and Jeng2020). Although online health communities can serve as an important online source that provides caregivers with personal support and social engagement, there are issues related to the interactions and the quality of information that caregivers may gain from online communities (Chi et al., Reference Chi, Thaker, He, Hui, Donovan, Brusilovsky and Lee2022).
Recently, Artificial Intelligence (AI) systems have been developed to be equipped with some level of relevant knowledge on diseases (Hui et al., Reference Hui, Wang, Kunsuk, Donovan, Brusilovsky, He, Lee, Strudwick, Hardiker, Rees, Cook and Lee2024; Thaker et al., Reference Thaker, Rahadari, Hui, Luo, Wang, Brusilovsky, He, Donovan, Lee, Strudwick, Hardiker, Rees, Cook and Lee2024; Y. Wang et al., Reference Wang, Thaker, Hui, Brusilovsky, He, Donovan, Lee, Strudwick, Hardiker, Rees, Cook and Lee2024; Z. Wang et al. Reference Wang, Zou, Xie, Luo, He, Hilsabeck, Aguirre, Toeppe, Yan and Chu2021), such that dynamic, tailored information can be provided to caregivers for better support. Within such AI systems, the critical first step of a tailored response is to accurately recognize and classify a caregiver’s expressed needs (Z. Wang et al., Reference Wang, Zou, Xie, Luo, He, Hilsabeck, Aguirre, Toeppe, Yan and Chu2021).
Identifying the category of the information needs expressed in a text is a classic text classification problem in AI and natural language processing. Both traditional statistical and more recent deep learning models have been employed to perform this task (Z. Wang et al., Reference Wang, Zou, Xie, Luo, He, Hilsabeck, Aguirre, Toeppe, Yan and Chu2021; Zou, Ji et al., Reference Zou, Thaker and He2023; Zou, Thaker, & He, Reference Zou, Thaker and He2023), but an adequate amount of annotated training data is required so that supervised models that provide highly accurate classification performance can be developed (Z. Wang et al., Reference Wang, Zou, Xie, Luo, He, Hilsabeck, Aguirre, Toeppe, Yan and Chu2021).
However, caregiving as a healthcare topic is still not sufficiently studied, and studies on caregivers’ needs have been conducted even less. For example, in a systematic literature review, Xie et al. (Reference Xie, Berkley, Kwak, Fleischmann, Champion and Koltai2018) found that caregivers’ information preferences are typically not assessed before or during the implementation of interventions; no intervention has assessed caregivers’ preferences for what information they want to receive or when or how they want to receive it. Shin and Habermann (Reference Shin and Habermann2022) also found that ADRD caregivers reported a greater level of lack of support from healthcare professionals and unmet needs for knowledge and resources.
This poses two important problems for the development of AI systems to classify caregivers’ needs:
(1) AI systems rely on domain experts to pass on domain knowledge via organized classes and descriptions of caregivers’ needs and annotated data for each category of needs, but domain experts such as clinicians often lack developed, comprehensive knowledge of caregiving and issues associated with diseases such as ADRD. It is difficult for domain experts to provide AI systems with a comprehensive and representative knowledge of caregivers’ needs.
(2) AI systems can help domain experts improve their knowledge of caregivers’ needs if they can reliably identify relevant information for each category of needs, but AI systems are not trained enough to complete such work accurately. So, they are not yet useful to help domain experts.
These two issues are intertwined. Domain experts can provide AI systems with better domain knowledge if they can get help from AI systems, and AI systems can better help domain experts if they have more domain knowledge from domain experts. Consequently, it makes great sense to study how human experts and AI systems collaborate to address this task.
In this chapter, we present our work in classifying caregivers’ needs through collaboration between human experts and AI systems. We use ADRD caregiving as our example domain. Dementia has become a major public health challenge (Shu & Woo, Reference Shu and Woo2021), and persons with dementia require extensive care, often provided by family members, who frequently report their challenges and stress (Reid & O’Brien, Reference Reid and O’Brien2021). AI can support caregivers, but research is needed to understand the ways in which AI might be effective (Ji et al., Reference Ji, Zou, Xie, He and Wang2022).
We first proposed and developed this use of AI in our earlier research (Z. Wang et al., Reference Wang, Zou, Xie, Luo, He, Hilsabeck, Aguirre, Toeppe, Yan and Chu2021), in which ADRD domain experts worked with an interactive machine learning system to develop and revise a framework to address the health information wants (HIWs) – the types of care-related information that caregivers of those with ADRD wish to have. This ADRD framework represents caregivers’ needs. Our methods can significantly reduce the amount of annotated data typically required for traditional ML models.
Large language models (LLMs) such as ChatGPT (Wu, He et al., Reference Wu, He, Liu, Sun, Liu, Han and Tang2023) and GPT-4 (Open AI et al., Reference Achiam, Adler, Agarwal, Ahmad, Akkaya, Aleman, Almeida, Altenschmidt, Altman, Anadkat, Avila, Babuschkin, Balaji, Balcom, Baltescu, Bao, Bavarian, Begum and Zoph2023) can comprehend and interpret users’ needs, and they can generate outputs that meet users’ requirements (Brown et al., Reference Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan, Child, Ramesh, Ziegler, Wu, Winter, Amodei, Larochelle, Ranzato, Hadsell, Balcan and Lin2020; Moor et al., Reference Moor, Banerjee, Abad, Krumholz, Leskovec, Topol and Rajpurkar2023). Compared with traditional few-shot learning methods in clinical natural language processing (Z. Li et al., Reference Li, Ma, Zhuang, Gu, Su and Chen2023), LLMs can generate high-quality responses to dementia caregivers’ questions to help them overcome the challenges that they face (Aguirre et al., Reference Aguirre, Hilsabeck, Smith, Xie, He, Wang and Zou2024).
In this chapter, we present our human–AI collaboration approach, in which an LLM is employed to generate simulated text, which is then validated by human experts. The validated data are subsequently treated as gold examples in in-context learning to enhance the LLM’s classification performance. Through this human–AI collaboration, human experts can leverage the assistance of LLMs to gain deeper insights into caregivers’ HIWs and provide more informed and tailored suggestions for them.
The set of research questions we explore are as follows:
At which stage of the HIW classification process can a human individual and AI collaborate?
What instructions (prompts) can we use to gain the best possible classification results from LLMs?
What machine learning concepts developed in the past can be applied to further improve LLMs’ performance?
To address these questions, we first review relevant literature on human–AI collaboration in healthcare and the application of AI including LLM technology in healthcare, with implications for our work. Then, we consider the background of ADRD-HIWs and our earlier exploration of expert–machine co-development. We outline our experimental design and present results from various classification experiments. Given these findings, we analyze and discuss our human–AI collaboration’s effectiveness and suggest potential directions for future research.
9.2 Literature Review
9.2.1 Human–AI Collaboration in Healthcare
Human–AI collaboration can be defined as AI systems working jointly with humans as teammates or partners to solve problems (Lai et al., Reference Lai, Kankanhalli, Ong and Bui2021). As D. Wang et al. (Reference Wang, Churchill, Maes, Fan, Shneiderman, Shi and Wang2020) have said, human–AI collaboration is not a new concept. Symbiotic computing, proposed by J. C. Licklider (Reference Licklider1960) in his seminal article “Man–Computer Symbiosis,” clearly presented the partnership of humans and machines. Different from the design of fully automatic AI systems, which are “black boxes” to humans, human-centered design is rooted in the design of algorithms and the implementation of applications with human–AI collaboration as a paradigm (D. Wang et al., Reference Wang, Churchill, Maes, Fan, Shneiderman, Shi and Wang2020). This is the basis of human–computer interaction (Shneiderman & Plaisant, Reference Shneiderman and Plaisant2009).
Healthcare is an important domain for the use of AI (Jiang et al., Reference Jiang, Jiang, Zhi, Dong, Li, Ma, Wang, Dong, Shen and Wang2017; Shaheen, Reference Shaheen2021) because AI can mitigate the shortage of qualified healthcare workers, assist overworked medical professionals, and improve the quality of healthcare services (Lai et al., Reference Lai, Kankanhalli, Ong and Bui2021). Healthcare is therefore a critical context for human–AI collaboration (Markus et al., Reference Markus, Kors and Rijnbeek2021).
Researchers working on human–AI collaboration have long viewed it as a socio-technical ensemble, where respective strengths of humans and AI can extend each other’s capability limits, resulting in superior task outcomes (Bansal et al., Reference Bansal, Nushi, Kamar, Horvitz and Weld2021). This is particularly since healthcare requires deep domain knowledge and the outcomes are critical to humans’ wellbeing.
Lai et al. (Reference Lai, Kankanhalli, Ong and Bui2021) searched five major databases for publications in disciplines such as computer science, information systems, health informatics, and medicine, in order to review the literature on human–AI collaboration in healthcare. Their initial searches yielded 1,019 publications, but, after several rounds of filtering and full text examination, only 28 relevant studies remained. They found that research with human–AI collaboration in healthcare was increasing but the number of studies was still limited; most articles were about generic use cases, with cancer and dementia as common disease cases; most research focused on treatment, surgery and diagnosis; and the studies involved healthcare professionals, patients, and clinical researchers with AI systems.
Hemmer and his colleagues (Reference Hemmer, Schemmer, Riefle, Rosellen, Vössing and Kühl2022) conducted semi-structured interviews and employed inductive coding to examine the factors influencing the adoption of human–AI collaboration in clinical decision-making. They found six factors, which can be summarized into one sentence “Professionals state the need for a complementary AI that communicates its insights transparently and adapts to the users to enable mutual learning and time-efficient work with a final human agency” (Hemmer et al., Reference Hemmer, Schemmer, Riefle, Rosellen, Vössing and Kühl2022, p. 2).
9.2.2 AI in Health Information Need Identification
With the huge amount of health information available on the internet, health consumers such as patients and caregivers actively use the internet for satisfying their health information needs (Chi et al., Reference Chi, He and Jeng2020). Despite the benefits of being confidential and anonymous, and providing emotional and social support by engaging others in online social platforms, seeking health information online can also be challenging due to the inherent complexity and uniqueness of health issues (Chi et al., Reference Chi, He and Jeng2020). Besides the health information needs of patients, there have been works on examining caregivers’ needs too (Zou, Thaker, & He, Reference Zou, Thaker and He2023), which showed that their needs also evolve throughout the disease trajectory.
With rapid developments in computational capabilities, many AI techniques, such as statistical machine learning models (Habehh & Gohel, Reference Habehh and Gohel2021) and deep learning models (Esteva et al., Reference Esteva, Robicquet, Ramsundar, Kuleshov, DePristo, Chou and Dean2019), have been applied to healthcare. Besides the common applications on clinical decision-making (Z. Li et al., Reference Li, Zhao, Dang, Yan, Gao, Wang and Xiao2024; Magrabi et al. Reference Magrabi, Ammenwerth, McNair, De Keizer, Hyppönen, Nykänen and Georgiou2019), x-ray imaging (Adams et al., Reference Adams, Henderson, Yi and Babyn2021), we also see recommender systems developed to support patients in satisfying their needs (Thaker et al., Reference Thaker, Rahadari, Hui, Luo, Wang, Brusilovsky, He, Donovan, Lee, Strudwick, Hardiker, Rees, Cook and Lee2024).
Online social platforms provide attractive places for patients and caregivers seeking answers to their health information needs, where the support is not only information oriented but can be emotional and social (Zou, Thaker, & He, Reference Zou, Thaker and He2023). Xie et al. (Reference Xie, Wang, Zou, Luo, Hilsabeck and Aguirre2020) examined ADRD caregivers’ information needs on Reddit. Zou, Thaker, and He (Reference Zou, Thaker and He2023) provided a systematic review on studies that apply AI for processing ADRD caregivers’ social media posts. Their initial search on ACM Digital Library, IEEE Xplore Digital Library, and PubMed generated 324 articles, but, after three rounds of screening, only 18 papers were selected for analysis. Their results show that the research in this area is still in its infancy, and much work still focuses on characterizing ADRD caregivers’ behaviors on social media and predicting ADRD related activities was the focus of using AI technologies in the majority of these studies.
9.2.3 LLMs in Healthcare
Remarkable performance improvements have been demonstrated by training LLMs on general domain data, which make domain-specific pretraining unnecessary. For instance, Kung et al. (Reference Kung, Cheatham, Medenilla, Sillos, De Leon, Elepaño and Tseng2023) evaluated GPT-3.5’s responses to the United States Medical Licensing Exam (USMLE) questions. Their results were achieved at or near the passing threshold even if the LLM was not specifically trained in related fields. Similarly, Nori et al. (Reference Nori, King, McKinney, Carignan and Horvitz2023) proved that GPT-4 exceeded the USMLE passing score by more than twenty points using simple 5-shot prompting. Additionally, DeID-GPT (Liu et al., Reference Liu, Huang, Yu, Zhang, Wu, Cao and Li2023) has utilized high-quality prompts to ensure privacy and summarize essential information in medical data, outperforming relevant baseline methods. Sivarajkumar and Wang (Reference Sivarajkumar and Wang2023) proposed a prompt-based clinical natural language processing framework HealthPrompt, which improved performance in clinical NLP tasks by exploring different prompt templates without requiring extra training data.
Furthermore, several studies have focused on text classification tasks within health-related fields. Guo et al. (Reference Guo, Ovadje, Al-Garadi and Sarker2024) employed data augmentation using GPT-4 alongside relatively small human-annotated datasets to train lightweight supervised classification models, achieving superior results compared to using human-annotated data alone. C. Wu et al. (Reference Wu, He, Liu, Sun, Liu, Han and Tang2023) explored the complex task of medical Chinese text categorization by leveraging the complementary strengths of three different sub-models. Their experimental results, obtained through a voting mechanism, demonstrated that the proposed method could achieve an accuracy of 92 percent. De Santis et al. (Reference De Santis, Martino, Ronci and Rizzi2024) compared LLMs like Mistral and GPT-4 with traditional methods such as BERT and SVM. Their findings indicated that Mistral-7B is the optimal choice as an algorithm within a decision support system for monitoring users in sensitive medical discourses. GPT-4 also showed strong in-context learning capabilities, particularly in zero-shot settings, making it a viable option for medical text classification. Additionally, Song et al. (Reference Song, Zhang, Tian, Yang, Huang and Li2024) proposed an LLM-based privacy data augmentation method for medical text classification, and their experimental results outperformed other baselines across different dataset sizes in text classification tasks.
In summary, human–AI collaboration is a topic that has lots of potential in the healthcare domain, but there is still a gap in studying how human–AI collaboration can be designed for helping caregivers involved in healthcare, particularly using the latest AI technologies like LLMs for performing human–AI collaboration.
9.3 Background
9.3.1 From HIWs to ADRD-HIWs
The shared decision-making model holds that patients (and, in the case of ADRD patients, their caregivers) should be expected and encouraged to stay informed and collaborate with medical professionals to make decisions (Charles et al., Reference Charles, Gafni and Whelan1999; Epstein et al., Reference Epstein, Alper and Quill2004). This paradigm has generated much interest in individual preferences for health information and participation in decision-making (Benbassat et al., Reference Benbassat, Pilpel and Tidhar1998). However, prior research has typically measured only a limited range of individual preferences for information and decision-making that reflects what physicians think their patients need (Beisecker & Beisecker, Reference Beisecker and Beisecker1990; Ende et al., Reference Ende, Kazis, Ash and Moskowitz1989) despite poor correlations between what physicians think their patients need and what patients really want to know (Xie Reference Xie2009). Thus, the validity of prior research is questionable (Hubbard, Reference Hubbard2008). Existing interventions, which typically base their content on theoretical constructs from behavior change theories (Harrington & Noar, Reference Harrington and Noar2012) do have their merits, but they are top-down, driven by what researchers think participants need to know, rather than what participants may want to know.
To address these limitations, Xie (Reference Xie2009) developed the Health Information Wants (HIW) framework that emphasizes “health information that one would like to have and use to make important health decisions that may or may not be directly related to diagnosis or standard treatment” (p. 510). The HIW framework promotes an understanding of preferences from the health consumer’s perspective, rather than the provider’s, by emphasizing a wide range of information and decision-making autonomy that health consumers want to have (Xie Reference Xie2009). Guided by the HIW framework, Xie and her colleagues developed the HIW Questionnaire (Xie et al., Reference Xie, Wang, Feldman and Zhou2010), which was validated among older and younger Americans, with excellent validity and reliability. The HIW framework is highly adaptable to specific populations’ unique circumstances, evidenced by its successful adaptations to and validations among individuals with diabetes (Nie et al., Reference Nie, Xie, Yang and Shan2016) and cancer patients and their families (Xie et al., Reference Xie, Su, Liu, Wang and Zhang2017). More recently, Zou, Thaker, and He (Reference Zou, Thaker and He2023) examined HIWs for caregivers of ovarian cancer patients, and Tang and colleagues (Reference Tang, Kwak, Xiao, Xie, Lahiri, Flynn and Murugadass2023) studied HIWs of ADRD caregivers expressed on social media.
Given the generalizability of the HIW framework, we developed the ADRD-HIW framework (Z. Wang et al., Reference Wang, Zou, Xie, Luo, He, Hilsabeck, Aguirre, Toeppe, Yan and Chu2021). Specifically, we adopted the HIW framework to ADRD caregiving scenarios with three rounds of development and revision. The final outcome of the ADRD-HIW framework has seven general categories that show the types of health information typically wanted by ADRD caregivers: (1) treatment/medication/prevention, (2) characteristics of/experience with the health condition/diagnostic procedures, (3) daily care for a patient at home/caregiver self-care, (4) practical information about care transition and coordination and end-of-life care, (5) psychosocial aspects of caregiving, (6) resources/advocacy/scientific updates/research participation, and (7) legal, financial, or insurance related information.
In our earlier work examining information exchange within online dementia care communities (Ji et al., Reference Ji, Zou, Xie, He and Wang2022; Z. Wang et al., Reference Wang, Zou, Xie, Luo, He, Hilsabeck, Aguirre, Toeppe, Yan and Chu2021; Zou, Thaker, & He, Reference Zou, Thaker and He2023), we identified that information related to the HIW category “daily care for patient at home/caregiver self-care” is among the most sought categories by ADRD caregivers. Similarly, prior review work examining past decades studies of ADRD caregiving have underscored that the majority of ADRD caregivers express a desire for support in helping daily care while balancing their caregiving role and their own personal well-being (Bressan et al., Reference Bressan, Visintini and Palese2020). Therefore, the clinicians in our team have developed a daily care framework to organize this specific type of ADRD-HIW into seven categories, with a total of thirty-two subcategories (for details, see Table 9.1).
Daily care category | Subcategories |
---|---|
1. Personal care | 1.1 Bathing 1.2 Dressing 1.3 Toileting/incontinence 1.4 Transferring 1.5 Feeding/nutrition 1.6 Dental care |
2. Household management | 2.1 Meal preparation 2.2 Housekeeping chores 2.3 Laundry |
3. Safety | 3.1 Driving safety 3.2 Medication safety 3.3 Financial safety 3.4 Level of supervision 3.5 Self-neglect 3.6 Safe phone use |
4. Mood and behavior management | 4.1 Aggression and anger 4.2 Anxiety and agitation 4.3 Depression 4.4 Hallucinations 4.5 Memory loss and confusion 4.6 Repetition 4.7 Sleep issues 4.8 Suspicious and delusions 4.9 Wandering |
5. Activities | 5.1 Planning daily activities 5.2 Choosing daily activities 5.3 Conducting daily activities |
6. Communication strategies | 6.1 Communication with a person with dementia 6.2 Communication with other family members |
7. Working with healthcare providers | 7.1 What to do prior to a visit 7.2 What to do during a visit 7.3 What to do after a visit |
9.3.2 Our Earlier Effort in Expert–Machine Co-development
As part of developing our HIW-ADRD framework and analyzing caregivers’ online posts, we developed the expert–machine co-development process (Z. Wang et al., Reference Wang, Zou, Xie, Luo, He, Hilsabeck, Aguirre, Toeppe, Yan and Chu2021).
As shown in Figure 9.1, this process aims to enable human experts to collaborate with machine learning algorithms in developing the HIW framework, at the same time improving the automatic classification of online posts. Empowered by their own domain knowledge (the HIW framework and clinical knowledge), human experts generate the initial data to train the machine algorithms. The machine algorithms then provided data-driven feedback to human experts, allowing the experts to learn from the feedback to update their domain knowledge, which resulted in revising the HIW framework with data-driven evidence. The change of the HIW framework triggered experts to change the data that was fed to the machine algorithms. This co-development process can be iterative between the experts and the machine algorithms so that both the experts’ clinical knowledge and the HIW framework, as well as the algorithm’s effectiveness of classification, can be continuously improved. This process utilizes interactive machine learning.

Figure 9.1 The expert–machine co-development (EMC) process (extracted from Z. Wang et al., Reference Wang, Zou, Xie, Luo, He, Hilsabeck, Aguirre, Toeppe, Yan and Chu2021)
Figure 9.1Long description
The model diagram has a human figure and an illustration of a machine algorithm at either end. A rightward arrow from the human model pointing to the machine algorithm reads, Data-driven feedback to human experts. A leftward arrow from the machine algorithm to the human model reads, Modeling to train the machine. On the left of the human model, there are two sets of cyclic arrows that read, the HIW framework and the Clinical Knowledge. There are five steps listed below. Step 1, Human experts’ prior knowledge that is the HIW framework and clinical knowledge is used to generate the initial modeling to train the machine. Step 2. Machine algorithms provide data-driven feedback to human experts. Step 3. Human experts learn from the feedback and revise the modeling. Step 4. Repeat 2 and 3. Step 5. Human experts improve the HIW framework and clinical knowledge leading to improved health outcomes.
9.4 Human–AI Collaboration for HIW Identifications
In our study, we rely on real posts from online social media, such as Reddit, to highlight the caregivers’ health information wants (HIWs). Therefore, the classification task involves categorizing a real post into one category of the HIW framework. This task imposes a heavy workload on clinicians due to the large volume of posts. While AI such as LLMs can assist, it struggles to classify posts accurately using plain prompts alone.
To overcome this challenge and facilitate collaboration between professional clinicians and LLM, we expanded our expert–machine co-development process to human–AI collaboration: human experts have general domain knowledge to offer in the collaboration so they are good at differentiating on-topic posts for a given HIW from other off-topic ones, whereas capable AI systems such as LLMs have the ability to generate simulated text for a given requirements or instructions, even in large quantity but they might hallucinate with wrong information. Consequently, rather than having clinicians manually annotate the massive dataset of real posts, we instructed the LLM with the definition of each HIW as part of the prompt to generate five simulated posts for each HIW category. These simulated posts were then reviewed and validated by clinicians, who ensured their accuracy. The verified posts were used as examples within the prompts (thus perform few-shot in-context learning (Z. Li et al., Reference Li, Ma, Zhuang, Gu, Su and Chen2023)) to guide the LLM in categorizing real posts. This method significantly reduced the manual annotation workload for clinicians, while the LLM benefited from the expert-validated examples, enhancing its performance through in-context learning. Our human–AI collaboration allows both parties to benefit from each other, resulting in improved classification accuracy.
Furthermore, we found it common for multiple possible categories to be returned after the initial classification, which led us to introduce a second round of classification to determine the most likely category based on the initial set. In the second round of classification, the LLM is prompted to select the most probable category based on the set of categories identified in the first round.
Our approach offers a new way of combining human expertise with AI to optimize task outcomes, and the results confirm the effectiveness of this collaborative method. Figure 9.2 explains the overall human–AI collaboration process between our clinicians and LLM. The details of Step 4 “the classification” is explained in Figure 9.3.

Figure 9.2 The human–AI collaboration in our classification task
Figure 9.2Long description
A model diagram begins with a group of doctors and nurses asking to generate simulated posts to LLM, that either returns the simulated posts or provides an output based on the provided examples and definitions. The output further reads, The most likely category, this post belongs to is, HIWx dot x. When the LLM returns the simulated posts, they go for verification using the classification prompt which then again are fed to the LLM to generate output. The steps listed below read as follows. Step 1. Clinicians ask the LLM to generate simulated posts for each HIW category. Step 2. The LLM returns the generated simulated posts to the clinicians. Step 3. Clinicians review and select the verified simulated post. Step 4. Verified simulated posts are concatenated with the real post in one classification prompt and request the LLM for an answer. Step 5. The LLM generates the answer based on the classification prompt and predicts the HIW category this real post may belong to.

Figure 9.3 The two-round classification framework
Figure 9.3Long description
The first round of classification. Clients ask GPT the following questions for N times, to which the answers are Yes or No. Whether this real post belongs to HIW 1 point 1. Whether this real post belongs to HIW 1 point 2. HIW x dot x ellipses. The majority voting goes to the Category Set from which the LLM selects the most suitable set. The Output reads, Based on the category set, the most likely category this post belongs to is HIW x dot x in the second round of classification.
9.5 Experiments
To demonstrate and examine our proposed human–AI collaboration for HIW classification, we conducted a series of experiments. In the remainder of this section, we will cover the datasets, LLMs, and methods.
9.5.1 Datasets
We collected real posts from a social media platform, Reddit. Reddit was chosen due to its prevalence among individuals sharing dementia-related challenges (Tang et al., Reference Tang, Kwak, Xiao, Xie, Lahiri, Flynn and Murugadass2023; Zou, Thaker, & He, Reference Zou, Thaker and He2023). The abundance of relevant posts, facilitated by the Reddit API, has enabled earlier research to identify and examine caregivers’ health information wants (HIWs). In total, 14,884 posts were collected from five subreddits: “AgingParents,” “Alzheimers,” “dementia,” “AlzheimersCanada,” and “AlzheimersSupport,” which cover from 2010 to 2022. Two clinicians, each with at least fifteen years of experience with persons living with dementia and their family caregivers, iteratively identified typical posts as seeds, then located similar posts using a snowball sampling technique (Goodman, Reference Goodman1961), and finally annotated the identified twenty posts in each of three HIW categories: “3.1 driving safety,” “4.1 aggression and anger,” and “4.5 memory loss and confusion.” Consensus was reached through discussion. These three HIWs were selected because they represent common asked questions from ADRD caregivers.
9.5.2 LLMs
In our experiments, we exclusively use LLMs developed by OpenAI for its strong in-context learning capabilities compared to other LLMs, which makes them optimal options for medical text classification (De Santis et al., Reference De Santis, Martino, Ronci and Rizzi2024). Specifically, we used the LLM GPT-4o to conduct all experiments. For parameter settings, we set the temperature as 0.8 for more varied outputs during the majority voting process. All the other parameters remained the same as default values in the official document.
9.5.3 Experiment Design
To answer the RQs mentioned in Section 9.1, we designed a set of experiments to establish baselines, explore model parameters, and conduct final assessments. In all the experiments, we model the task as the scenario in which a caregiver expresses their caregiving situations and asks one or multiple questions as their expressed needs and the AI system (LLM) automatically classifies the post to one of the HIWs in our ADRD-HIW framework.
Baseline Experiments
We established two baselines in our experiments. The first one views the task as a multi-class classification task where the input post is classified to one of the thirty-two HIW categories, where GPT-4o was used as the classifier, and the category definition for each of the thirty-two categories is used to represent the category. This baseline represents the simplest and most straightforward classification setup. It provides a foundational comparison for subsequent experiments where we will explore more optimized methods for improving the classification accuracy of the LLM in identifying caregivers’ HIWs.
Since thirty-two categories are a large number to perform multi-class classification, we explore another baseline where we model the class organization hierarchically rather than flat. For example, suppose we aim to classify a post with the true category of “3.1 driving safety,” which has been shown in Table 9.1. First, we attempt to classify the post into one of seven general HIWs. If the post is correctly labeled under the “Daily Care” category, we proceed to classify it into the categories within “Daily Care.” Next, if the post is classified into the appropriate category, “3. Safety” within “Daily Care,” we continue further classifying it into the final specific subcategory, choosing among the subcategories 3.1 to 3.6 within “3. Safety.” In this setting, only the posts correctly classified in the previous hierarchy will be fed into the next level of classification. We also include this hierarchical classification as one baseline and explore further refinement to improve overall accuracy.
Experiments on Exploring Classification Techniques
In designing our classifier, we learned two things from the baselines presented in the previous subsection. First, the definition of each HIW category alone did not work well, we should also give GPT-4o posts as the examples belong to the HIW category, thus we will perform few-shot in-context learning on LLMs (Dong et al., Reference Dong, Li, Dai, Zheng, Ma, Li, Xia, Xu, Wu, Chang, Sun, Li and Sui2022). Second, the baselines showed that multi-class classification on thirty-two HIW categories at once may be too challenging for GPT-4o. We thus convert the classification task to a series of binary classifications. Specifically, GPT-4o was asked a yes–no question for a given post on whether it belongs to a particular HIW category (one of the thirty-two HIW categories). This binary classification was repeated thirty-two times to cover all categories, which will identify all applicable categories for each post. To decrease the complexity during our exploration of parameters, we only performed binary classification on the three target categories in our dataset (3.1, 4.1, and 4.5).
In the exploration of the first parameter, we focused on real posts vs simulated posts. We performed one experiment of selecting three real posts from each target category to perform three binary classifications with few-shots in-context learning on GPT-4o. Then we utilized the simulated post to replace the real posts for few-shots in-context learning.
The simulated posts were generated through our human–AI collaboration. First, we asked GPT-4o to generate five simulated posts for each category, ensuring they aligned with the category definitions. Then, these simulated posts were reviewed and verified by professional clinicians, who selected the most representative ones. This process was repeated if the required number of simulated posts were not selected after the verification.
The second parameter we explored was around majority voting. Majority voting is an important technique in ensemble learning, and Huang et al.’s (2022) demonstrated that it works in LLMs. During our experiments, we also observed that GPT-4o can produce different answers even when asked the same questions at different times. To ensure more reliable and consistent final answers, we adopted the majority voting method: by asking the LLM to answer the same question multiple times, we can gather responses and select the final answer based on the most frequent outcome. This majority voting approach leverages the self-consistency of the LLM, helping to increase the confidence in the final answer. In our model, we set the number of repeating questions asking the same question for majority voting as N = 5.
The third parameter we explored focuses on system prompt. Previous research indicated that different system prompts, which define the roles that LLM can take could influence LLM behavior (Z. Li et al., Reference Li, Zhao, Dang, Yan, Gao, Wang and Xiao2024; Zheng et al., Reference Zheng, Pei and Jurgens2023). We changed the system prompt from the default “You are a helpful assistant.” to “You are an experienced clinician specialized in dementia care.” We assume the more specific role on ADRD may help LLM gain background knowledge and enhance the results.
We present one specific prompt for HIW “3.1 driving safety” here:
System prompt: You are an experienced clinician specialized in dementia care.
I’m doing a research study to understand the challenges experienced by caregivers when caring for family members living with dementia and the types of information caregivers may want that can help them manage those challenges. I have categorized one of those types of information that caregivers might want as “driving safety” and defined it as “information about how to manage driving, e.g., when, and how to get a person living with dementia to stop driving.”
Meanwhile, I recognize that many dementia caregivers have posted their challenges on Reddit. Their posts often indicate what types of information they want to have that may help them manage those challenges. I have already composed five representative posts for this category of “driving safety.” Now, I will give you these five representative posts; your task is to first learn the patterns in these representative posts and then to decide whether a new post that I give you belongs to this same category (YES or NO) and why. The five representative posts are:
Post 1:
Title: Need Advice on Managing Driving for Mom with Early Dementia
Content: Hi everyone, I’m feeling overwhelmed about the driving situation with my mom. She’s been diagnosed with early-stage dementia and I’m not sure how to approach the topic of her stopping driving. She loves her independence but has had a couple of minor incidents recently. Does anyone have tips or resources on how to handle this gently?
Post 2:
Title: When Is It Time to Stop Driving?
Content: My dad’s dementia is progressing and I’m starting to worry about his driving. He hasn’t had any accidents yet, but he gets confused easily. How did you all decide it was time for your loved one to give up the keys? Looking for guidance and personal experiences that could help.
Post 3:
Title: Resources for Driving Assessments
Content: Hello, does anyone know of any reliable assessments for driving ability in seniors with cognitive impairments? My aunt has been showing signs of forgetfulness and it’s getting concerning. I think a professional evaluation might help, but I don’t know where to start. Any recommendations?
Post 4:
Title: Strategies for Discussing Driving with a Dementia Patient
Content: Need some advice here. My grandmother has moderate dementia and it’s clear she shouldn’t be driving anymore. Every time we bring it up, she gets very defensive and upset. Has anyone else dealt with this? What strategies worked for you in having this difficult conversation?
Post 5:
Title: Alternatives to Driving for Dementia Patients
Content: My spouse has been advised to stop driving due to his dementia. He’s struggling with losing his independence and I want to make this transition as smooth as possible. What alternatives have worked for your loved ones? Any tips on introducing new modes of transport?
The new post that I need you to decide if it belongs to this same category is:
Title: {Filled with real post title}
Content: {Filled with real post content}
Optimized Experiment on Thirty-Two HIW Classifications
Finally, after exploring the parameters of our model, we conducted experiments on classifying against all thirty-two categories to assess our classification model. Due to the large number of categories and the complexity of real posts, we observed that our classifier often indicated “yes” for multiple categories. To address this and ensure that the most likely category is returned, we implemented a second round of classification. The detail of this two-round classification framework has been illustrated in Figure 9.3.
9.6 Results
In this section, we will present classification results corresponding to the experiments designed in Section 9.5.
9.6.1 Baseline Results
The baseline experiment results are presented in Tables 9.2 and 9.3, respectively. Table 9.3 also shows the ratio of correctly classified posts after each hierarchical classification. Only the posts that survive the current hierarchical classification will be fed into the next hierarchical classification. For example, only six out of twenty posts with the true label “3.1 driving safety” are correctly classified into “Daily Care,” which is one of the seven general HIWs. Then only five among these six posts are classified into “3. Safety,” which is one of the seven categories in the general HIW “Daily Care” that “3.1 driving safety” belongs (see Table 9.1).
Correct Prediction | Accuracy | |
---|---|---|
3.1 Driving safety | 16/20 | 0.80 |
4.1 Aggression and anger | 14/20 | 0.70 |
4.5 Memory loss and confusion | 9/20 | 0.45 |
Total | 39/60 | 0.65 |
7 General HIWs | 7 Categories in “Daily Care” | 32 Subcategories in “Daily Care” | Overall Correct Prediction (Accuracy) | |
---|---|---|---|---|
3.1 Driving safety | 6/20 (0.30) | 5/6 (0.83) | 5/5 (1.0) | 5/20 (0.25) |
4.1 Aggression and anger | 3/20 (0.15) | 2/3 (0.67) | 2/2 (1.0) | 2/20 (0.10) |
4.5 Memory loss and confusion | 5/20 (0.25) | 3/5 (0.60) | 0/3 (0) | 0/20 (0) |
Total | 14/60 (0.23) | 10/14 (0.71) | 7/10 (0.7) | 7/60 (0.12) |
The results in Tables 9.2 and 9.3 show that flat multi-class classification baseline performed better than the hierarchical multi-class classification baseline (0.65 vs 0.12). Although fewer classes are considered in each step of hierarchical classification, the accumulated errors in each step still have a big impact on the final results. Thus, dividing the thirty-two multiclasses into smaller hierarchical multiclasses is not an appropriate solution for this task.
9.6.2 Results of Human–AI Collaboration and Different Classification Techniques
Table 9.4 presents the outcomes of our human–AI collaboration method and different parameters in our classification model.

Table 9.4Long description
The table has a column titled correct prediction, accuracy. It has four sub-columns for 3.1 driving safety, 4.1 aggression and anger, 4.5 memory loss and confusion, and total. The table has four rows for real post, simulated posts, plus majority voting, and plus system prompt.
Row 1 reads. Real posts. 17 slash 20, 0.85. 15 slash 20, 0.75. 14 slash 20, 0.70. 46 slash 60, 0.77.
Row 2 reads. Simulated posts. 18 slash 20, 0.90. 18 slash 20, 0.90. 17 slash 20, 0.85. 53 slash 60, 0.88.
Row 3 reads. Plus majority voting. 19 slash 20, 0.95. 19 slash 20, 0.95. 20 slash 20, 1.00. 58 slash 60, 0.97.
Row 4 reads. Plus system prompt. 19 slash 20, 0.95. 18 slash 20, 0.90. 20 slash 20, 1.00. 57 slash 60, 0.95.
First, the results show that adding sample posts on top of definitions of HIW categories can help LLM classify posts more effectively. The classification with real posts improved the average accuracy for the three categories by 12 percent compared to the baselines in Table 9.2 (0.77 vs 0.65). Furthermore, employing simulated posts can further improve the average accuracy by another 11 percent (0.88 vs 0.77). We hypothesized that simulated posts generated by GPT-4o typically focus on more relevant content to the HIW category and contain less irrelevant information compared to real posts. The review of our clinicians on the simulated posts and real posts confirmed our hypothesis. Consequently, our approach of using LLMs to generate simulated posts provides an important insight into obtaining sample data for performing few-shot in-context learning with LLMs. Simulated and verified data is better training data for LLMs, which are much cheaper and less time-consuming to obtain. Human–AI collaboration can be very helpful here.
Second, adding the majority voting method greatly improved the performance. The accuracy nearly reaches 100 percent across all three categories. Notably, the improvement in category 4.5, “Memory loss and confusion,” is significant, achieving 100 percent accuracy. This is particularly remarkable because category 4.5 consistently had the lowest accuracy in earlier experiments, making it the most challenging category to classify. The success of the majority voting method highlights its ability to reduce randomness and leverage the self-consistency of LLMs to refine the final decision.
Third, the last row of Table 9.4 shows that the accuracy of using a task-specific role in system prompt slightly dropped compared to the default role that LLM takes (0.95 vs 0.97). Nonetheless, it would be premature to conclude that system prompts have no impact on classification tasks. The extremely high accuracy already achieved in the previous experiment might leave little room for further improvement. More extensive studies are needed to better understand the impact of system prompts on related tasks in the future.
9.6.3 Optimized Results of Thirty-Two Multiclass Classifications
Finally, Table 9.5 presents the practical scenario where we used thirty-two category definitions as input and conducted the two classifications, incorporating simulated posts and majority voting. This approach posed significantly more difficulty than our previous experiments in Table 9.4, where we focused on just three target categories.
Correct Prediction | Accuracy | |
---|---|---|
3.1 driving safety | 20/20 | 1.00 |
4.1 aggression and anger | 19/20 | 0.95 |
4.5 memory loss and confusion | 14/20 | 0.70 |
Total | 53/60 | 0.88 |
First, Table 9.5 shows that the parameters we selected based on experiments on three HIW categories can still have a significant impact on the classification tasks performed on all thirty-two categories. Table 9.5 shows that the overall accuracy performed on thirty-two independent binary classifications with simulated posts and majority voting showed great improvement compared to the baseline in Table 9.2 (0.88 vs 0.65), particularly on 3.1 Driving safety (20/20) and 4.1 Aggression and anger (19/20). We also performed statistical tests on these three categories and calculated p-values of 0.033, 0.021, and 0.047, indicating statistically significant differences between the results in Tables 9.2 and 9.5. This confirms that breaking down a complex multiclass classification task into multiple independent binary classifications, using simulated posts as ICL examples, and applying majority voting can be effective classification techniques.
Notably, the results for “3.1 Driving safety” (20/20) and “4.1 Aggression and anger” (19/20) are excellent, even comparable to the simpler three-category classification experiments, which demonstrates the effectiveness of our method.
However, we observed a decrease in accuracy for “4.5 Memory loss and confusion” (14/20) compared to prior results. Upon careful review, we found that, while the target category was consistently included in the initial set, the second round of classification often failed to correctly select it as the most likely category. A potential explanation is that “4.5 Memory loss and confusion” is inherently a challenging category, as many ADRD posts contain varying degrees of information related to memory issues, making it difficult to distinguish this category. Future studies could explore advanced techniques to improve the second round of classification, particularly for complex categories like this one.
9.7 Discussions
9.7.1 Insights
Our experiments demonstrated the importance of having data (such as posts) to describe caregivers’ needs, especially in the few-shot in-context learning with LLMs. Even though domain experts can provide an accurate definition for each HIW category, having posts that describe caregivers’ situations and needs can greatly help the classification.
But our study showed that real posts from caregivers can be problematic to obtain and use. First, it is not difficult to obtain real posts for HIW categories that are common but some HIW categories are rare, so real posts matching to that category can be difficult to obtain. Obtaining simulated posts via human–AI collaboration and using LLMs is a great solution to this problem. Our study showed that it is much cheaper and less time consuming to obtain simulated posts even though human experts still need to verify the simulated posts for their accuracy in content.
Second, our experiments also show that simulated posts can be better examples to be used in few-shot in-context learning for LLMs to perform HIW classification because the verified simulated posts typically focus on more relevant content to the HIW category and contain less irrelevant information compared to real posts. This character of simulated posts again highlights the importance of human–AI collaboration for HIW classification.
Third, given the complexity of the categories we defined, asking the LLM to perform direct multiclass classification on a large number of categories is challenging. Our work offers an alternative approach by breaking down the single multiclass classification task into multiple independent binary classification tasks. Although this method may take more time to run and gather predictions from each task, binary classification tasks are more manageable for LLMs and often result in better overall performance.
Our final insight centers around the randomness and self-consistency in the outputs from LLMs. The literature reported that LLMs can generate different outcomes with the same input, and yet LLMs can show a certain degree of consistency among the variations of its outcomes (Huang et al., 2022). The majority voting idea implemented in our model takes advantage of this feature of LLMs and makes it a technique to further improve our HIW classification. Our experiment demonstrated the effectiveness of this technique and is consistent with the literature.
9.7.2 Limitations and Future Work
Although our method has demonstrated effectiveness in the ADRD-HIWs classification task, this technique has several limitations. First, because of the significant cost of manual annotation, our dataset has only three HIWs with twenty annotated posts for each category. Due to the limited dataset size, the evaluation performance was less conclusive than it would have been with a larger dataset. Future research would greatly benefit from a larger sample of posts that cover a broader range of HIWs caregivers may encounter. Secondly, our methods only explored limited variations in majority voting (n = 5) and system prompts. Future work could investigate adjusting the number of majority votes and experimenting with different system prompts. Thirdly, the classification performance on certain categories, like “4.5 Memory loss and confusion,” as shown in Table 9.5, was unsatisfactory. Thus, more advanced classification techniques could be incorporated into the second round of classification to improve performance.
9.8 Conclusions
In this chapter, we introduce a human–AI collaboration approach for identifying caregivers’ Health Information Wants (HIWs) in the context of dementia care. We establish an interactive collaboration pipeline that allows human experts to leverage the capabilities of advanced AI techniques. Specifically, we employ GPT-4o as the LLM to extract domain-specific health knowledge for constructing a HIWs classification system. A key innovation of our work is the use of simulated posts generated by GPT-4o, which are then verified by professional clinicians to create example templates. This novel approach to human–AI collaboration enables our experts to benefit from high-quality AI-generated content, significantly reducing the manual effort required for annotation, and verified posts can help LLM improve classification accuracy in return. Additionally, we explore methods such as in-context learning, majority voting, customized system prompts, and dividing complex multiclass classification tasks into multiple independent binary classifications. Through a series of experiments, we demonstrate the effectiveness of these combined methods in improving classification performance. Our findings provide valuable insights into human–AI collaboration, particularly in health-related domains, offering a promising framework for future research in this area.