To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge-org.demo.remotlog.com
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The emergence of large language models has significantly expanded the use of natural language processing (NLP), even as it has heightened exposure to adversarial threats. We present an overview of adversarial NLP with an emphasis on challenges, policy implications, emerging areas, and future directions. First, we review attack methods and evaluate the vulnerabilities of popular NLP models. Then, we review defense strategies that include adversarial training. We describe major policy implications, identify key trends, and suggest future directions, such as the use of Bayesian methods to improve the security and robustness of NLP systems.
Codebooks—documents that operationalize concepts and outline annotation procedures—are used almost universally by social scientists when coding political texts. To code these texts automatically, researchers are increasingly turning to generative large language models (LLMs). However, there is limited empirical evidence on whether “off-the-shelf” LLMs faithfully follow real-world codebook operationalizations and measure complex political constructs with sufficient accuracy. To address this, we gather and curate three real-world political science codebooks—covering protest events, political violence, and manifestos—along with their unstructured texts and human-coded labels. We also propose a five-stage framework for codebook-LLM measurement: Preparing a codebook for both humans and LLMs, testing LLMs’ basic capabilities on a codebook, evaluating zero-shot measurement accuracy (i.e., off-the-shelf performance), analyzing errors, and further (parameter-efficient) supervised training of LLMs. We provide an empirical demonstration of this framework using our three codebook datasets and several pre-trained 7–12 billion open-weight LLMs. We find current open-weight LLMs have limitations in following codebooks zero-shot, but that supervised instruction-tuning can substantially improve performance. Rather than suggesting the “best” LLM, our contribution lies in our codebook datasets, evaluation framework, and guidance for applied researchers who wish to implement their own codebook-LLM measurement projects.
Do oral arguments influence state supreme courts, and if so, how? Focusing on a “thinking-fast” framework, this study analyzes 2014–2021 New York Court of Appeals oral arguments to test whether non-traditional factors such as expressed emotion can shape decisions. Empirical analysis drawn from textual data shows that oral arguments can explain decision-making, and that justices’ emotion during arguments likely plays a role. The findings challenge normatively rational models of judicial behavior by underscoring affective, real-time influences and highlight oral arguments as a consequential stage in subnational adjudication. This is the first evidence of their meaningful role in state supreme courts.
Abstract screening, a labor-intensive aspect of systematic review, is increasingly challenging due to the rising volume of scientific publications. Recent advances suggest that generative large language models like generative pre-trained transformer (GPT) could aid this process by classifying references into study types such as randomized-controlled trials (RCTs) or animal studies prior to abstract screening. However, it is unknown how these GPT models perform in classifying such scientific study types in the biomedical field. Additionally, their performance has not been directly compared with earlier transformer-based models like bidirectional encoder representations from transformers (BERT). To address this, we developed a human-annotated corpus of 2,645 PubMed titles and abstracts, annotated for 14 study types, including different types of RCTs and animal studies, systematic reviews, study protocols, case reports, as well as in vitro studies. Using this corpus, we compared the performance of GPT-3.5 and GPT-4 in automatically classifying these study types against established BERT models. Our results show that fine-tuned pretrained BERT models consistently outperformed GPT models, achieving F1-scores above 0.8, compared to approximately 0.6 for GPT models. Advanced prompting strategies did not substantially boost GPT performance. In conclusion, these findings highlight that, even though GPT models benefit from advanced capabilities and extensive training data, their performance in niche tasks like scientific multi-class study classification is inferior to smaller fine-tuned models. Nevertheless, the use of automated methods remains promising for reducing the volume of records, making the screening of large reference libraries more feasible. Our corpus is openly available and can be used to harness other natural language processing (NLP) approaches.
Word processing during reading is known to be influenced by lexical features, especially word length, frequency, and predictability. This study examined the relative importance of these features in word processing during second language (L2) English reading. We used data from an eye-tracking corpus and applied a machine-learning approach to model word-level eye-tracking measures and identify key predictors. Predictors comprised several lexical features, including length, frequency, and predictability (e.g., surprisal). Additionally, sentence, passage, and reader characteristics were considered for comparison. The analysis found that word length was the most important variable across several eye-tracking measures. However, for certain measures, word frequency and predictability were more important than length, and in some cases, reader characteristics such as proficiency were more significant than lexical features. These findings highlight the complexity of word processing during reading, the shared processes between first language (L1) and L2 reading, and their potential to refine models of eye-movement control.
Technical standards provide order and consistency in application domains; however, standards development organizations produce large families of related documents containing significant amounts of information that can be difficult to access, evaluate, and produce consistently. We describe standards as linguistically, socially, and conceptually dynamic constructs using theory drawn from systems engineering and linguistics to create a model of standards documents that can be updated, evaluated, and queried to retrieve information reliably. We describe the theoretical basis for this model from multiple perspectives and explain broadly how it can be used to retrieve relevant information from standards.
Cannabis use is elevated in youth with depression and attention-deficit/hyperactivity disorder (ADHD), but drivers of this increase remain underexplored. The self-medication hypothesis suggests cannabis is used by patients for mood regulation, a common difficulty in ADHD and depression. This study aimed to examine associations between mood instability and cannabis use in a large, representative clinical cohort of adolescents diagnosed with ADHD and/or depression.
Methods
Natural language processing (NLP) approaches were utilised to identify references to mood instability and cannabis use in the electronic health records of adolescents (aged 11–18 years) with primary diagnoses of ADHD (n = 7,985) or depression (n = 5,738). Logistic regression was used to examine mood instability as the main exposure for cannabis use in models stratified by ADHD and depression.
Results
Mood instability was associated with a 25% higher probability of cannabis use in adolescents with ADHD compared to those with depression. Following adjustment for available sociodemographic and clinical covariates, mood instability was associated with increased cannabis use in both ADHD (aOR: 1.61 [95% CI: 1.41–1.84]) and depression (aOR: 1.38 [95% CI: 1.21–1.57]) groups.
Conclusions
This was the first study to explore the differential impact of mood instability on adolescent cannabis use across distinct diagnostic profiles. NLP analysis proved an efficient tool for examining large populations of adolescents accessing psychiatric services and provided preliminary evidence of a link between mood instability and cannabis use in ADHD and depression. Longitudinal studies using direct measures or tailored NLP techniques can further establish the directionality of these associations.
Electronic Health Record (EHR) data are critical for advancing translational research and AI technologies. The ENACT network offers access to structured EHR data across 57 CTSA hubs. However, substantial information is contained in clinical narratives, requiring natural language processing (NLP) for research. The ENACT NLP Working Group was formed to make NLP-derived clinical information accessible and queryable across the network.
Methods:
We established the ENACT NLP Working Group with 13 sites selected based on criteria including clinical notes access, IT infrastructure, NLP expertise, and institutional support. We divided sites into five focus groups targeting clinical tasks within disease contexts. Each focus group consisted of two development sites and two validation sites. We extended the ENACT ontology to standardize NLP-derived data and conducted multisite evaluations using the Open Health Natural Language Processing (OHNLP) Toolkit.
Results:
The working group achieved 100% site retention and deployed NLP infrastructure across all sites. We developed and validated NLP algorithms for rare disease phenotyping, social determinants of health, opioid use disorder, sleep phenotyping, and delirium phenotyping. Performance varied across sites (F1 scores 0.53–0.96), highlighting data heterogeneity impacts. We extended the ENACT common data model and ontology to incorporate NLP-derived data while maintaining Shared Health Research Informatics NEtwork (SHRINE) compatibility.
Conclusion:
This demonstrates feasibility of deploying NLP infrastructure across large, federated networks. The focus group approach proved more practical than general-purpose approaches. Key lessons include the challenge of data heterogeneity and importance of collaborative governance. This work also provides a foundation that other networks can build on to implement NLP capabilities for translational research.
This article develops the first dynamic method for systematically estimating the ideologies and other traits of nearly the entire federal judiciary. The Jurist-Derived Judicial Ideology Scores (JuDJIS) method derives from computational text analysis of over 20,000 written evaluations by a representative sample of tens of thousands of jurists as part of an ongoing, systematic survey initiative begun in 1985. The resulting data constitute not only the first such comprehensive federal-court measure that is dynamic, but also the only such measure that is based on judging, and the only such measure that is potentially multi-dimensional. The results of empirical validity tests reflect these advantages. Validation on a set of several-thousand appellate decisions indicates that the ideology estimates predict outcomes significantly more accurately than the existing appellate measures, such as the Judicial Common Space. In addition to informing theoretical debates about the nature of judicial ideology and decision-making, the JuDJIS initiative might lead courts scholars to revisit some of the lower-court research findings of the last two decades, which are generally based on static, non-judicial models. Perhaps most importantly, this method could foster breakthroughs in courts research that, until now, were impossible due to data limitations.
Protest event analysis (PEA) is the core method to understand spatial patterns and temporal dynamics of protest. We show how Large Language Models (LLM) can be used to automate the classification of protest events and of political event data more broadly with levels of accuracy comparable to humans, while reducing necessary annotation time by several orders of magnitude. We propose a modular pipeline for the automation of PEA (PAPEA) based on fine-tuned LLMs and provide publicly available models and tools which can be easily adapted and extended. PAPEA enables getting from newspaper articles to PEA datasets with high levels of precision without human intervention. A use case based on a large German news-corpus illustrates the potential of PAPEA.
Improving media adherence to World Health Organization (WHO) guidelines is crucial for preventing suicidal behaviors in the general population. However, there is currently no valid, rapid, and effective method to evaluate the adherence to these guidelines.
Methods
This comparative effectiveness study (January–August 2024) evaluated the ability of two artificial intelligence (AI) models (Claude Opus 3 and GPT-4O) to assess the adherence of media reports to WHO suicide-reporting guidelines. A total of 120 suicide-related articles (40 in English, 40 in Hebrew, and 40 in French) published within the past 5 years were sourced from prominent newspapers. Six trained human raters (two per language) independently evaluated articles based on a WHO guideline-based questionnaire addressing aspects, such as prominence, sensationalism, and prevention. The same articles were also processed using AI models. Intraclass correlation coefficients (ICCs) and Spearman correlations were calculated to assess agreement between human raters and AI models.
Results
Overall adherence to WHO guidelines was ~50% across all languages. Both AI models demonstrated strong agreement with human raters, with GPT-4O showing the highest agreement (ICC = 0.793 [0.702; 0.855]). The combined evaluations of GPT-4O and Claude Opus 3 yielded the highest reliability (ICC = 0.812 [0.731; 0.869]).
Conclusions
AI models can replicate human judgment in evaluating media adherence to WHO guidelines. However, they have limitations and should be used alongside human oversight. These findings may suggest that AI tools have the potential to enhance and promote responsible reporting practices among journalists and, thus, may support suicide prevention efforts globally.
The covert administration of medicines is associated with multiple legal and ethical issues. We aimed to develop a natural language processing (NLP) methodology to identify instances of covert administration from electronic mental health records. We used this NLP method to pilot an audit of the use of covert administration.
Results
We developed a method that was able to identify covert administration through free-text searching with a precision of 72%. Pilot audit results showed that 95% of patients receiving covert administration (n = 41/43) had evidence of a completed mental capacity assessment and best interests meeting. Pharmacy was contacted for information about administration for 77% of patients.
Clinical implications
We demonstrate a simple, readily deployable NLP method that has potential wider applicability to other areas. This method also has potential to be applied via real-time health record processing to prompt and facilitate active monitoring of covert administration of medicines.
A critical challenge for biomedical investigators is the delay between research and its adoption, yet there are few tools that use bibliometrics and artificial intelligence to address this translational gap. We built a tool to quantify translation of clinical investigation using novel approaches to identify themes in published clinical trials from PubMed and their appearance in the natural language elements of the electronic health record (EHR).
Methods:
As a use case, we selected the translation of known health effects of exercise for heart disease, as found in published clinical trials, with the appearance of these themes in the EHR of heart disease patients seen in an emergency department (ED). We present a self-supervised framework that quantifies semantic similarity of themes within the EHR.
Results:
We found that 12.7% of the clinical trial abstracts dataset recommended aerobic exercise or strength training. Of the ED treatment plans, 19.2% related to heart disease. Of these, the treatment plans that included heart disease identified aerobic exercise or strength training only 0.34% of the time. Treatment plans from the overall ED dataset mentioned aerobic exercise or strength training less than 5% of the time.
Conclusions:
Having access to publicly available clinical research and associated EHR data, including clinician notes and after-visit summaries, provided a unique opportunity to assess the adoption of clinical research in medical practice. This approach can be used for a variety of clinical conditions, and if assessed over time could measure implementation effectiveness of quality improvement strategies and clinical guidelines.
The transformation in the purposes, instruments, and conditions for the deployment of coercion was a central aspect of the modernization of Western European states during the long nineteenth century. Nowhere is this transformation as evident as in the emergence and diffusion of public, specialized, and professional police forces at the time. In this article, we employ automated text analysis to explore legislative debates on policing in the United Kingdom from 1803 to 1945. We identify three distinct periods in which policing was highly salient in Parliament, each of them related to more general processes driving the modernization of the British state. The first period (1830s–1850s) was marked by the institutionalization of modern police forces and their spread across Great Britain. The second period (1880s–1890s) was dominated by Irish MPs denouncing police abuses against their constituents. The third period (1900s–1940s) was characterized by discussions around working conditions for the police in the context of mounting social pressures and war-related police activities. Whereas the first and third periods have attracted much scholarly interest as they culminated in concrete police reforms, the second period has not been as central to historical research on the British police. We show, however, that policing became a major issue in the legislative agenda of the 1880s and 1890s, as it highlighted the tensions within a modernizing British state, torn between the professionalization of domestic police forces under control of local authorities and the persistence of imperial practices in its colonial territories.
One of the most significant challenges in research related to nutritional epidemiology is the achievement of high accuracy and validity of dietary data to establish an adequate link between dietary exposure and health outcomes. Recently, the emergence of artificial intelligence (AI) in various fields has filled this gap with advanced statistical models and techniques for nutrient and food analysis. We aimed to systematically review available evidence regarding the validity and accuracy of AI-based dietary intake assessment methods (AI-DIA). In accordance with PRISMA guidelines, an exhaustive search of the EMBASE, PubMed, Scopus and Web of Science databases was conducted to identify relevant publications from their inception to 1 December 2024. Thirteen studies that met the inclusion criteria were included in this analysis. Of the studies identified, 61·5 % were conducted in preclinical settings. Likewise, 46·2 % used AI techniques based on deep learning and 15·3 % on machine learning. Correlation coefficients of over 0·7 were reported in six articles concerning the estimation of calories between the AI and traditional assessment methods. Similarly, six studies obtained a correlation above 0·7 for macronutrients. In the case of micronutrients, four studies achieved the correlation mentioned above. A moderate risk of bias was observed in 61·5 % (n 8) of the articles analysed, with confounding bias being the most frequently observed. AI-DIA methods are promising, reliable and valid alternatives for nutrient and food estimations. However, more research comparing different populations is needed, as well as larger sample sizes, to ensure the validity of the experimental designs.
This study examines the impact of temperature on human well-being using approximately 80 million geo-tagged tweets from Argentina spanning 2017–2022. Employing text mining techniques, we derive two quantitative estimators: sentiments and a social media aggression index. The Hedonometer Index measures overall sentiment, distinguishing positive and negative ones, while social media aggressive behavior is assessed through profanity frequency. Non-linear fixed effects panel regressions reveal a notable negative causal association between extreme heat and the overall sentiment index, with a weaker relationship found for extreme cold. Our results highlight that, while heat strongly influences negative sentiments, it has no significant effect on positive ones. Consequently, the overall impact of extremely high temperatures on sentiment is predominantly driven by heightened negative feelings in hot conditions. Moreover, our profanity index exhibits a similar pattern to that observed for negative sentiments.
As the use of guided digitally-delivered cognitive-behavioral therapy (GdCBT) grows, pragmatic analytic tools are needed to evaluate coaches’ implementation fidelity.
Aims
We evaluated how natural language processing (NLP) and machine learning (ML) methods might automate the monitoring of coaches’ implementation fidelity to GdCBT delivered as part of a randomized controlled trial.
Method
Coaches served as guides to 6-month GdCBT with 3,381 assigned users with or at risk for anxiety, depression, or eating disorders. CBT-trained and supervised human coders used a rubric to rate the implementation fidelity of 13,529 coach-to-user messages. NLP methods abstracted data from text-based coach-to-user messages, and 11 ML models predicting coach implementation fidelity were evaluated.
Results
Inter-rater agreement by human coders was excellent (intra-class correlation coefficient = .980–.992). Coaches achieved behavioral targets at the start of the GdCBT and maintained strong fidelity throughout most subsequent messages. Coaches also avoided prohibited actions (e.g. reinforcing users’ avoidance). Sentiment analyses generally indicated a higher frequency of coach-delivered positive than negative sentiment words and predicted coach implementation fidelity with acceptable performance metrics (e.g. area under the receiver operating characteristic curve [AUC] = 74.48%). The final best-performing ML algorithms that included a more comprehensive set of NLP features performed well (e.g. AUC = 76.06%).
Conclusions
NLP and ML tools could help clinical supervisors automate monitoring of coaches’ implementation fidelity to GdCBT. These tools could maximize allocation of scarce resources by reducing the personnel time needed to measure fidelity, potentially freeing up more time for high-quality clinical care.
Machine learning has exhibited substantial success in the field of natural language processing (NLP). For example, large language models have empirically proven to be capable of producing text of high complexity and cohesion. However, at the same time, they are prone to inaccuracies and hallucinations. As these systems are increasingly integrated into real-world applications, ensuring their safety and reliability becomes a primary concern. There are safety critical contexts where such models must be robust to variability or attack and give guarantees over their output. Computer vision had pioneered the use of formal verification of neural networks for such scenarios and developed common verification standards and pipelines, leveraging precise formal reasoning about geometric properties of data manifolds. In contrast, NLP verification methods have only recently appeared in the literature. While presenting sophisticated algorithms in their own right, these papers have not yet crystallised into a common methodology. They are often light on the pragmatical issues of NLP verification, and the area remains fragmented. In this paper, we attempt to distil and evaluate general components of an NLP verification pipeline that emerges from the progress in the field to date. Our contributions are twofold. First, we propose a general methodology to analyse the effect of the embedding gap – a problem that refers to the discrepancy between verification of geometric subspaces, and the semantic meaning of sentences which the geometric subspaces are supposed to represent. We propose a number of practical NLP methods that can help to quantify the effects of the embedding gap. Second, we give a general method for training and verification of neural networks that leverages a more precise geometric estimation of semantic similarity of sentences in the embedding space and helps to overcome the effects of the embedding gap in practice.
Understanding and tracking societal discourse around essential governance challenges of our times is crucial. One possible heuristic is to conceptualize discourse as a network of actors and policy beliefs.
Here, we present an exemplary and widely applicable automated approach to extract discourse networks from large volumes of media data, as a bipartite graph of organizations and beliefs connected by stance edges. Our approach leverages various natural language processing techniques, alongside qualitative content analysis. We combine named entity recognition, named entity linking, supervised text classification informed by close reading, and a novel stance detection procedure based on large language models.
We demonstrate our approach in an empirical application tracing urban sustainable transport discourse networks in the Swiss urban area of Zürich over 12 years, based on more than one million paragraphs extracted from slightly less than two million newspaper articles.
We test the internal validity of our approach. Based on evaluations against manually automated data, we find support for what we call the window validity hypothesis of automated discourse network data gathering. The internal validity of automated discourse network data gathering increases if inferences are combined over sliding time windows.
Our results show that when leveraging data redundancy and stance inertia through windowed aggregation, automated methods can recover basic structure and higher-level structurally descriptive metrics of discourse networks well. Our results also demonstrate the necessity of creating high-quality test sets and close reading and that efforts invested in automation should be carefully considered.