Search

Adversarial natural language processing: overview, challenges, and policy implications
Laxmi Shaw, Mohammed Wasim Ansari, Tahir Ekin
Journal:

Data & Policy / Volume 7 / 2025

Published online by Cambridge University Press:

22 September 2025, e64
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
The emergence of large language models has significantly expanded the use of natural language processing (NLP), even as it has heightened exposure to adversarial threats. We present an overview of adversarial NLP with an emphasis on challenges, policy implications, emerging areas, and future directions. First, we review attack methods and evaluate the vulnerabilities of popular NLP models. Then, we review defense strategies that include adversarial training. We describe major policy implications, identify key trends, and suggest future directions, such as the use of Bayesian methods to improve the security and robustness of NLP systems.

Codebook LLMs: Evaluating LLMs as Measurement Tools for Political Science Concepts
Andrew Halterman, Katherine A. Keith
Journal:

Political Analysis , First View

Published online by Cambridge University Press:

19 September 2025, pp. 1-17
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Codebooks—documents that operationalize concepts and outline annotation procedures—are used almost universally by social scientists when coding political texts. To code these texts automatically, researchers are increasingly turning to generative large language models (LLMs). However, there is limited empirical evidence on whether “off-the-shelf” LLMs faithfully follow real-world codebook operationalizations and measure complex political constructs with sufficient accuracy. To address this, we gather and curate three real-world political science codebooks—covering protest events, political violence, and manifestos—along with their unstructured texts and human-coded labels. We also propose a five-stage framework for codebook-LLM measurement: Preparing a codebook for both humans and LLMs, testing LLMs’ basic capabilities on a codebook, evaluating zero-shot measurement accuracy (i.e., off-the-shelf performance), analyzing errors, and further (parameter-efficient) supervised training of LLMs. We provide an empirical demonstration of this framework using our three codebook datasets and several pre-trained 7–12 billion open-weight LLMs. We find current open-weight LLMs have limitations in following codebooks zero-shot, but that supervised instruction-tuning can substantially improve performance. Rather than suggesting the “best” LLM, our contribution lies in our codebook datasets, evaluation framework, and guidance for applied researchers who wish to implement their own codebook-LLM measurement projects.

Justices on Autopilot: Thinking-Fast Evidence from State Supreme Court Oral Arguments
Thomas Altmann
Journal:

Journal of Law and Courts ,

Published online by Cambridge University Press:

15 September 2025, pp. 1-26
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Do oral arguments influence state supreme courts, and if so, how? Focusing on a “thinking-fast” framework, this study analyzes 2014–2021 New York Court of Appeals oral arguments to test whether non-traditional factors such as expressed emotion can shape decisions. Empirical analysis drawn from textual data shows that oral arguments can explain decision-making, and that justices’ emotion during arguments likely plays a role. The findings challenge normatively rational models of judicial behavior by underscoring affective, real-time influences and highlight oral arguments as a consequential stage in subnational adjudication. This is the first evidence of their meaningful role in state supreme courts.

StudyTypeTeller—Large language models to automatically classify research study types for systematic reviews
Simona Emilova Doneva, Shirin de Viragh, Hanna Hubarava, Stefan Schandelmaier, Matthias Briel, Benjamin Victor Ineichen
Journal:

Research Synthesis Methods ,

Published online by Cambridge University Press:

11 September 2025, pp. 1-20
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Abstract screening, a labor-intensive aspect of systematic review, is increasingly challenging due to the rising volume of scientific publications. Recent advances suggest that generative large language models like generative pre-trained transformer (GPT) could aid this process by classifying references into study types such as randomized-controlled trials (RCTs) or animal studies prior to abstract screening. However, it is unknown how these GPT models perform in classifying such scientific study types in the biomedical field. Additionally, their performance has not been directly compared with earlier transformer-based models like bidirectional encoder representations from transformers (BERT). To address this, we developed a human-annotated corpus of 2,645 PubMed titles and abstracts, annotated for 14 study types, including different types of RCTs and animal studies, systematic reviews, study protocols, case reports, as well as in vitro studies. Using this corpus, we compared the performance of GPT-3.5 and GPT-4 in automatically classifying these study types against established BERT models. Our results show that fine-tuned pretrained BERT models consistently outperformed GPT models, achieving F1-scores above 0.8, compared to approximately 0.6 for GPT models. Advanced prompting strategies did not substantially boost GPT performance. In conclusion, these findings highlight that, even though GPT models benefit from advanced capabilities and extensive training data, their performance in niche tasks like scientific multi-class study classification is inferior to smaller fine-tuned models. Nevertheless, the use of automated methods remains promising for reducing the volume of records, making the screening of large reference libraries more feasible. Our corpus is openly available and can be used to harness other natural language processing (NLP) approaches.

Relative importance of lexical features in word processing during L2 English reading
Shingo Nahatame, Satoru Uchida
Journal:

Studies in Second Language Acquisition , First View

Published online by Cambridge University Press:

29 August 2025, pp. 1-24
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Word processing during reading is known to be influenced by lexical features, especially word length, frequency, and predictability. This study examined the relative importance of these features in word processing during second language (L2) English reading. We used data from an eye-tracking corpus and applied a machine-learning approach to model word-level eye-tracking measures and identify key predictors. Predictors comprised several lexical features, including length, frequency, and predictability (e.g., surprisal). Additionally, sentence, passage, and reader characteristics were considered for comparison. The analysis found that word length was the most important variable across several eye-tracking measures. However, for certain measures, word frequency and predictability were more important than length, and in some cases, reader characteristics such as proficiency were more significant than lexical features. These findings highlight the complexity of word processing during reading, the shared processes between first language (L1) and L2 reading, and their potential to refine models of eye-movement control.

Standards as Discourse
Jacob Collard, Eswaran Subrahmanian, Spencer Breiner, Ram D. Sriram
Journal:

Proceedings of the Design Society / Volume 5 / August 2025

Published online by Cambridge University Press:

27 August 2025, pp. 2911-2920
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Technical standards provide order and consistency in application domains; however, standards development organizations produce large families of related documents containing significant amounts of information that can be difficult to access, evaluate, and produce consistently. We describe standards as linguistically, socially, and conceptually dynamic constructs using theory drawn from systems engineering and linguistics to create a model of standards documents that can be updated, evaluated, and queried to retrieve information reliably. We describe the theoretical basis for this model from multiple perspectives and explain broadly how it can be used to retrieve relevant information from standards.

Mood instability as a transdiagnostic predictor of cannabis use in attention-deficit/hyperactivity disorder and depression: A natural language processing analysis of electronic health records from 13,025 adolescents
Asilay Seker, Edward Bullock, Susie Chandler, Rashmi Patel, Diego Quattrone, Craig Colling, Edmund J. S. Sonuga-Barke, Johnny Downs
Journal:

European Psychiatry / Volume 68 / Issue 1 / 2025

Published online by Cambridge University Press:

22 August 2025, e139
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Background
Cannabis use is elevated in youth with depression and attention-deficit/hyperactivity disorder (ADHD), but drivers of this increase remain underexplored. The self-medication hypothesis suggests cannabis is used by patients for mood regulation, a common difficulty in ADHD and depression. This study aimed to examine associations between mood instability and cannabis use in a large, representative clinical cohort of adolescents diagnosed with ADHD and/or depression.
Methods
Natural language processing (NLP) approaches were utilised to identify references to mood instability and cannabis use in the electronic health records of adolescents (aged 11–18 years) with primary diagnoses of ADHD (n = 7,985) or depression (n = 5,738). Logistic regression was used to examine mood instability as the main exposure for cannabis use in models stratified by ADHD and depression.
Results
Mood instability was associated with a 25% higher probability of cannabis use in adolescents with ADHD compared to those with depression. Following adjustment for available sociodemographic and clinical covariates, mood instability was associated with increased cannabis use in both ADHD (aOR: 1.61 [95% CI: 1.41–1.84]) and depression (aOR: 1.38 [95% CI: 1.21–1.57]) groups.
Conclusions
This was the first study to explore the differential impact of mood instability on adolescent cannabis use across distinct diagnostic profiles. NLP analysis proved an efficient tool for examining large populations of adolescents accessing psychiatric services and provided preliminary evidence of a link between mood instability and cannabis use in ADHD and depression. Longitudinal studies using direct measures or tailored NLP techniques can further establish the directionality of these associations.

Development and validation of natural language processing algorithms in the national ENACT network
Yanshan Wang, Jordan Hilsman, Chenyu Li, Michele Morris, Paul M. Heider, Sunyang Fu, Min Ji Kwak, Andrew Wen, Joseph R. Applegate, Liwei Wang, Elmer Bernstam, Hongfang Liu, Jack Chang, Daniel R. Harris, Alexandria Corbeau, Darren Henderson, John Osborne, Richard E. Kennedy, Nelly-Estefanie Garduno-Rapp, Justin F. Rousseau, Chao Yan, You Chen, Mayur B. Patel, Tyler J. Murphy, Bradley A. Malin, Chan Mi Park, Jungwei W. Fan, Sunghwan Sohn, Sandeep Pagali, Yifan Peng, Aman Pathak, Yonghui Wu, Zongqi Xia, Salvatore Loguercio, Steven E. Reis, Shyam Visweswaran
Journal:

Journal of Clinical and Translational Science / Volume 9 / Issue 1 / 2025

Published online by Cambridge University Press:

22 August 2025, e199
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Objective:
Electronic Health Record (EHR) data are critical for advancing translational research and AI technologies. The ENACT network offers access to structured EHR data across 57 CTSA hubs. However, substantial information is contained in clinical narratives, requiring natural language processing (NLP) for research. The ENACT NLP Working Group was formed to make NLP-derived clinical information accessible and queryable across the network.
Methods:
We established the ENACT NLP Working Group with 13 sites selected based on criteria including clinical notes access, IT infrastructure, NLP expertise, and institutional support. We divided sites into five focus groups targeting clinical tasks within disease contexts. Each focus group consisted of two development sites and two validation sites. We extended the ENACT ontology to standardize NLP-derived data and conducted multisite evaluations using the Open Health Natural Language Processing (OHNLP) Toolkit.
Results:
The working group achieved 100% site retention and deployed NLP infrastructure across all sites. We developed and validated NLP algorithms for rare disease phenotyping, social determinants of health, opioid use disorder, sleep phenotyping, and delirium phenotyping. Performance varied across sites (F1 scores 0.53–0.96), highlighting data heterogeneity impacts. We extended the ENACT common data model and ontology to incorporate NLP-derived data while maintaining Shared Health Research Informatics NEtwork (SHRINE) compatibility.
Conclusion:
This demonstrates feasibility of deploying NLP infrastructure across large, federated networks. The focus group approach proved more practical than general-purpose approaches. Key lessons include the challenge of data heterogeneity and importance of collaborative governance. This work also provides a foundation that other networks can build on to implement NLP capabilities for translational research.

An Expert-Sourced Measure of Judicial Ideology
Kevin L. Cope
Journal:

Political Analysis , First View

Published online by Cambridge University Press:

15 August 2025, pp. 1-20
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
This article develops the first dynamic method for systematically estimating the ideologies and other traits of nearly the entire federal judiciary. The Jurist-Derived Judicial Ideology Scores (JuDJIS) method derives from computational text analysis of over 20,000 written evaluations by a representative sample of tens of thousands of jurists as part of an ongoing, systematic survey initiative begun in 1985. The resulting data constitute not only the first such comprehensive federal-court measure that is dynamic, but also the only such measure that is based on judging, and the only such measure that is potentially multi-dimensional. The results of empirical validity tests reflect these advantages. Validation on a set of several-thousand appellate decisions indicates that the ideology estimates predict outcomes significantly more accurately than the existing appellate measures, such as the Judicial Common Space. In addition to informing theoretical debates about the nature of judicial ideology and decision-making, the JuDJIS initiative might lead courts scholars to revisit some of the lower-court research findings of the last two decades, which are generally based on static, non-judicial models. Perhaps most importantly, this method could foster breakthroughs in courts research that, until now, were impossible due to data limitations.

PAPEA: A modular pipeline for the automation of protest event analysis
Sebastian Haunss, Priska Daphi, Jan Matti Dollbaum, Lidiya Hristova, Pál Susánszky, Elias Steinhilper
Journal:

Political Science Research and Methods , First View

Published online by Cambridge University Press:

23 June 2025, pp. 1-18
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Protest event analysis (PEA) is the core method to understand spatial patterns and temporal dynamics of protest. We show how Large Language Models (LLM) can be used to automate the classification of protest events and of political event data more broadly with levels of accuracy comparable to humans, while reducing necessary annotation time by several orders of magnitude. We propose a modular pipeline for the automation of PEA (PAPEA) based on fine-tuned LLMs and provide publicly available models and tools which can be easily adapted and extended. PAPEA enables getting from newspaper articles to PEA datasets with high levels of precision without human intervention. A use case based on a large German news-corpus illustrates the potential of PAPEA.

The role of generative artificial intelligence in evaluating adherence to responsible press media reports on suicide: A multisite, three-language study
Zohar Elyospeh, Bénédicte Nobile, Inbar Levkovich, Raphael Chancel, Philippe Courtet, Yossi Levi-Belz
Journal:

European Psychiatry / Volume 68 / Issue 1 / 2025

Published online by Cambridge University Press:

27 May 2025, e81
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Background
Improving media adherence to World Health Organization (WHO) guidelines is crucial for preventing suicidal behaviors in the general population. However, there is currently no valid, rapid, and effective method to evaluate the adherence to these guidelines.
Methods
This comparative effectiveness study (January–August 2024) evaluated the ability of two artificial intelligence (AI) models (Claude Opus 3 and GPT-4O) to assess the adherence of media reports to WHO suicide-reporting guidelines. A total of 120 suicide-related articles (40 in English, 40 in Hebrew, and 40 in French) published within the past 5 years were sourced from prominent newspapers. Six trained human raters (two per language) independently evaluated articles based on a WHO guideline-based questionnaire addressing aspects, such as prominence, sensationalism, and prevention. The same articles were also processed using AI models. Intraclass correlation coefficients (ICCs) and Spearman correlations were calculated to assess agreement between human raters and AI models.
Results
Overall adherence to WHO guidelines was ~50% across all languages. Both AI models demonstrated strong agreement with human raters, with GPT-4O showing the highest agreement (ICC = 0.793 [0.702; 0.855]). The combined evaluations of GPT-4O and Claude Opus 3 yielded the highest reliability (ICC = 0.812 [0.731; 0.869]).
Conclusions
AI models can replicate human judgment in evaluating media adherence to WHO guidelines. However, they have limitations and should be used alongside human oversight. These findings may suggest that AI tools have the potential to enhance and promote responsible reporting practices among journalists and, thus, may support suicide prevention efforts globally.

Natural language processing application to identify covert administration of medicines: development and pilot audit
Ninoslav Majkic, Jyoti Sanyal, Robert Stewart, Nicola Funnell, Delia Bishara
Journal:

BJPsych Bulletin , FirstView

Published online by Cambridge University Press:

23 May 2025, pp. 1-5
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Aims and method
The covert administration of medicines is associated with multiple legal and ethical issues. We aimed to develop a natural language processing (NLP) methodology to identify instances of covert administration from electronic mental health records. We used this NLP method to pilot an audit of the use of covert administration.
Results
We developed a method that was able to identify covert administration through free-text searching with a precision of 72%. Pilot audit results showed that 95% of patients receiving covert administration (n = 41/43) had evidence of a completed mental capacity assessment and best interests meeting. Pharmacy was contacted for information about administration for 77% of patients.
Clinical implications
We demonstrate a simple, readily deployable NLP method that has potential wider applicability to other areas. This method also has potential to be applied via real-time health record processing to prompt and facilitate active monitoring of covert administration of medicines.

From theory to practice – assessing translation of physical fitness research in the emergency department through machine learning and natural language processing
Kristin Morrow, Debajyoti Datta, Lindsey Spiegelman, Roy Almog, Kai Zheng, Don Brown, Dan Michael Cooper
Journal:

Journal of Clinical and Translational Science / Volume 9 / Issue 1 / 2025

Published online by Cambridge University Press:

21 May 2025, e133
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Background:
A critical challenge for biomedical investigators is the delay between research and its adoption, yet there are few tools that use bibliometrics and artificial intelligence to address this translational gap. We built a tool to quantify translation of clinical investigation using novel approaches to identify themes in published clinical trials from PubMed and their appearance in the natural language elements of the electronic health record (EHR).
Methods:
As a use case, we selected the translation of known health effects of exercise for heart disease, as found in published clinical trials, with the appearance of these themes in the EHR of heart disease patients seen in an emergency department (ED). We present a self-supervised framework that quantifies semantic similarity of themes within the EHR.
Results:
We found that 12.7% of the clinical trial abstracts dataset recommended aerobic exercise or strength training. Of the ED treatment plans, 19.2% related to heart disease. Of these, the treatment plans that included heart disease identified aerobic exercise or strength training only 0.34% of the time. Treatment plans from the overall ED dataset mentioned aerobic exercise or strength training less than 5% of the time.
Conclusions:
Having access to publicly available clinical research and associated EHR data, including clinician notes and after-visit summaries, provided a unique opportunity to assess the adoption of clinical research in medical practice. This approach can be used for a variety of clinical conditions, and if assessed over time could measure implementation effectiveness of quality improvement strategies and clinical guidelines.

The Rise of Modern Police Forces in the United Kingdom: Tracking Legislative Debates Around Police Reform (1803–1945)
Oriol Sabaté, Agustín Goenaga
Journal:

Social Science History / Volume 49 / Issue 2 / Summer 2025

Published online by Cambridge University Press:

18 June 2025, pp. 395-421

Print publication:

Summer 2025
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
The transformation in the purposes, instruments, and conditions for the deployment of coercion was a central aspect of the modernization of Western European states during the long nineteenth century. Nowhere is this transformation as evident as in the emergence and diffusion of public, specialized, and professional police forces at the time. In this article, we employ automated text analysis to explore legislative debates on policing in the United Kingdom from 1803 to 1945. We identify three distinct periods in which policing was highly salient in Parliament, each of them related to more general processes driving the modernization of the British state. The first period (1830s–1850s) was marked by the institutionalization of modern police forces and their spread across Great Britain. The second period (1880s–1890s) was dominated by Irish MPs denouncing police abuses against their constituents. The third period (1900s–1940s) was characterized by discussions around working conditions for the police in the context of mounting social pressures and war-related police activities. Whereas the first and third periods have attracted much scholarly interest as they culminated in concrete police reforms, the second period has not been as central to historical research on the British police. We show, however, that policing became a major issue in the legislative agenda of the 1880s and 1890s, as it highlighted the tensions within a modernizing British state, torn between the professionalization of domestic police forces under control of local authorities and the persistence of imperial practices in its colonial territories.

Validity and accuracy of artificial intelligence-based dietary intake assessment methods: a systematic review
Sebastián Cofre, Camila Sanchez, Gladys Quezada-Figueroa, Xaviera A. López-Cortés
Journal:

British Journal of Nutrition / Volume 133 / Issue 9 / 14 May 2025

Published online by Cambridge University Press:

10 April 2025, pp. 1241-1253

Print publication:

14 May 2025
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
One of the most significant challenges in research related to nutritional epidemiology is the achievement of high accuracy and validity of dietary data to establish an adequate link between dietary exposure and health outcomes. Recently, the emergence of artificial intelligence (AI) in various fields has filled this gap with advanced statistical models and techniques for nutrient and food analysis. We aimed to systematically review available evidence regarding the validity and accuracy of AI-based dietary intake assessment methods (AI-DIA). In accordance with PRISMA guidelines, an exhaustive search of the EMBASE, PubMed, Scopus and Web of Science databases was conducted to identify relevant publications from their inception to 1 December 2024. Thirteen studies that met the inclusion criteria were included in this analysis. Of the studies identified, 61·5 % were conducted in preclinical settings. Likewise, 46·2 % used AI techniques based on deep learning and 15·3 % on machine learning. Correlation coefficients of over 0·7 were reported in six articles concerning the estimation of calories between the AI and traditional assessment methods. Similarly, six studies obtained a correlation above 0·7 for macronutrients. In the case of micronutrients, four studies achieved the correlation mentioned above. A moderate risk of bias was observed in 61·5 % (n 8) of the articles analysed, with confounding bias being the most frequently observed. AI-DIA methods are promising, reliable and valid alternatives for nutrient and food estimations. However, more research comparing different populations is needed, as well as larger sample sizes, to ensure the validity of the experimental designs.

Impact of temperature on expressed sentiments in social media: evidence from a Latin American country
José Daniel Aromí, Mariana Conte Grand, Mariano Rabassa, Julie Rozenberg
Journal:

Environment and Development Economics , First View

Published online by Cambridge University Press:

07 April 2025, pp. 1-25
- Article
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
This study examines the impact of temperature on human well-being using approximately 80 million geo-tagged tweets from Argentina spanning 2017–2022. Employing text mining techniques, we derive two quantitative estimators: sentiments and a social media aggression index. The Hedonometer Index measures overall sentiment, distinguishing positive and negative ones, while social media aggressive behavior is assessed through profanity frequency. Non-linear fixed effects panel regressions reveal a notable negative causal association between extreme heat and the overall sentiment index, with a weaker relationship found for extreme cold. Our results highlight that, while heat strongly influences negative sentiments, it has no significant effect on positive ones. Consequently, the overall impact of extremely high temperatures on sentiment is predominantly driven by heightened negative feelings in hot conditions. Moreover, our profanity index exhibits a similar pattern to that observed for negative sentiments.

Capitalizing on natural language processing (NLP) to automate the evaluation of coach implementation fidelity in guided digital cognitive-behavioral therapy (GdCBT)
Nur Hani Zainal, Regina Eckhardt, Gavin N. Rackoff, Ellen E. Fitzsimmons-Craft, Elsa Rojas-Ashe, Craig Barr Taylor, Burkhardt Funk, Daniel Eisenberg, Denise E. Wilfley, Michelle G. Newman
Journal:

Psychological Medicine / Volume 55 / 2025

Published online by Cambridge University Press:

02 April 2025, e106
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Background
As the use of guided digitally-delivered cognitive-behavioral therapy (GdCBT) grows, pragmatic analytic tools are needed to evaluate coaches’ implementation fidelity.
Aims
We evaluated how natural language processing (NLP) and machine learning (ML) methods might automate the monitoring of coaches’ implementation fidelity to GdCBT delivered as part of a randomized controlled trial.
Method
Coaches served as guides to 6-month GdCBT with 3,381 assigned users with or at risk for anxiety, depression, or eating disorders. CBT-trained and supervised human coders used a rubric to rate the implementation fidelity of 13,529 coach-to-user messages. NLP methods abstracted data from text-based coach-to-user messages, and 11 ML models predicting coach implementation fidelity were evaluated.
Results
Inter-rater agreement by human coders was excellent (intra-class correlation coefficient = .980–.992). Coaches achieved behavioral targets at the start of the GdCBT and maintained strong fidelity throughout most subsequent messages. Coaches also avoided prohibited actions (e.g. reinforcing users’ avoidance). Sentiment analyses generally indicated a higher frequency of coach-delivered positive than negative sentiment words and predicted coach implementation fidelity with acceptable performance metrics (e.g. area under the receiver operating characteristic curve [AUC] = 74.48%). The final best-performing ML algorithms that included a more comprehensive set of NLP features performed well (e.g. AUC = 76.06%).
Conclusions
NLP and ML tools could help clinical supervisors automate monitoring of coaches’ implementation fidelity to GdCBT. These tools could maximize allocation of scarce resources by reducing the personnel time needed to measure fidelity, potentially freeing up more time for high-quality clinical care.

NLP verification: towards a general methodology for certifying robustness
Marco Casadio, Tanvi Dinkar, Ekaterina Komendantskaya, Luca Arnaboldi, Matthew L. Daggitt, Omri Isac, Guy Katz, Verena Rieser, Oliver Lemon
Journal:

European Journal of Applied Mathematics , First View

Published online by Cambridge University Press:

02 April 2025, pp. 1-58
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Machine learning has exhibited substantial success in the field of natural language processing (NLP). For example, large language models have empirically proven to be capable of producing text of high complexity and cohesion. However, at the same time, they are prone to inaccuracies and hallucinations. As these systems are increasingly integrated into real-world applications, ensuring their safety and reliability becomes a primary concern. There are safety critical contexts where such models must be robust to variability or attack and give guarantees over their output. Computer vision had pioneered the use of formal verification of neural networks for such scenarios and developed common verification standards and pipelines, leveraging precise formal reasoning about geometric properties of data manifolds. In contrast, NLP verification methods have only recently appeared in the literature. While presenting sophisticated algorithms in their own right, these papers have not yet crystallised into a common methodology. They are often light on the pragmatical issues of NLP verification, and the area remains fragmented. In this paper, we attempt to distil and evaluate general components of an NLP verification pipeline that emerges from the progress in the field to date. Our contributions are twofold. First, we propose a general methodology to analyse the effect of the embedding gap – a problem that refers to the discrepancy between verification of geometric subspaces, and the semantic meaning of sentences which the geometric subspaces are supposed to represent. We propose a number of practical NLP methods that can help to quantify the effects of the embedding gap. Second, we give a general method for training and verification of neural networks that leverages a more precise geometric estimation of semantic similarity of sentences in the embedding space and helps to overcome the effects of the embedding gap in practice.

Automated extraction of discourse networks from large volumes of media data
Part of
- Network Approaches to Attitudes and Beliefs
Mario Angst, Neitah Noemi Müller, Viviane Walker
Journal:

Network Science / Volume 13 / 2025

Published online by Cambridge University Press:

02 April 2025, e4
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Understanding and tracking societal discourse around essential governance challenges of our times is crucial. One possible heuristic is to conceptualize discourse as a network of actors and policy beliefs.
Here, we present an exemplary and widely applicable automated approach to extract discourse networks from large volumes of media data, as a bipartite graph of organizations and beliefs connected by stance edges. Our approach leverages various natural language processing techniques, alongside qualitative content analysis. We combine named entity recognition, named entity linking, supervised text classification informed by close reading, and a novel stance detection procedure based on large language models.
We demonstrate our approach in an empirical application tracing urban sustainable transport discourse networks in the Swiss urban area of Zürich over 12 years, based on more than one million paragraphs extracted from slightly less than two million newspaper articles.
We test the internal validity of our approach. Based on evaluations against manually automated data, we find support for what we call the window validity hypothesis of automated discourse network data gathering. The internal validity of automated discourse network data gathering increases if inferences are combined over sliding time windows.
Our results show that when leveraging data redundancy and stance inertia through windowed aggregation, automated methods can recover basic structure and higher-level structurally descriptive metrics of discourse networks well. Our results also demonstrate the necessity of creating high-quality test sets and close reading and that efforts invested in automation should be carefully considered.

Preface: Special issue on Natural Language Processing applications for low-resource languages
Part of
- NLP Editorial Board access (current content+all back content)
Partha Pakray, Alexander Gelbukh, Sivaji Bandyopadhyay
Journal:

Natural Language Processing / Volume 31 / Issue 2 / March 2025

Published online by Cambridge University Press:

28 February 2025, pp. 181-182
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation

Search Results

Refine search

Refine search

Actions for selected content:

140 results

Adversarial natural language processing: overview, challenges, and policy implications

Codebook LLMs: Evaluating LLMs as Measurement Tools for Political Science Concepts

Justices on Autopilot: Thinking-Fast Evidence from State Supreme Court Oral Arguments

StudyTypeTeller—Large language models to automatically classify research study types for systematic reviews

Relative importance of lexical features in word processing during L2 English reading

Standards as Discourse

Mood instability as a transdiagnostic predictor of cannabis use in attention-deficit/hyperactivity disorder and depression: A natural language processing analysis of electronic health records from 13,025 adolescents

Development and validation of natural language processing algorithms in the national ENACT network

An Expert-Sourced Measure of Judicial Ideology

PAPEA: A modular pipeline for the automation of protest event analysis

The role of generative artificial intelligence in evaluating adherence to responsible press media reports on suicide: A multisite, three-language study

Natural language processing application to identify covert administration of medicines: development and pilot audit

From theory to practice – assessing translation of physical fitness research in the emergency department through machine learning and natural language processing

The Rise of Modern Police Forces in the United Kingdom: Tracking Legislative Debates Around Police Reform (1803–1945)

Validity and accuracy of artificial intelligence-based dietary intake assessment methods: a systematic review

Impact of temperature on expressed sentiments in social media: evidence from a Latin American country

Capitalizing on natural language processing (NLP) to automate the evaluation of coach implementation fidelity in guided digital cognitive-behavioral therapy (GdCBT)

NLP verification: towards a general methodology for certifying robustness

Automated extraction of discourse networks from large volumes of media data

Preface: Special issue on Natural Language Processing applications for low-resource languages

Search Results

Refine search

Refine search

Actions for selected content:

Save Search

140 results