Hostname: page-component-65b85459fc-jtdgp Total loading time: 0 Render date: 2025-10-16T09:05:35.467Z Has data issue: false hasContentIssue false

Decolonizing Archival Narratives: Exploring Digital Bias in the Catalogs of Portuguese-Colonized African Territories

Published online by Cambridge University Press:  14 October 2025

Agata Błoch*
Affiliation:
Tadeusz Manteuffel Institute of History, Polish Academy of Sciences, Poland
Guillem Martos Oms
Affiliation:
University of Barcelona, Spain
Clodomir Santana
Affiliation:
Tadeusz Manteuffel Institute of History, Polish Academy of Sciences and University of California Davis, USA
*
Corresponding author: Agata Błoch; Email: agata.natalia.bloch@gmail.com
Rights & Permissions [Opens in a new window]

Abstract

This study discusses the intersection between Black/African Digital Humanities, and computational methods, including natural language processing (NLP) and generative artificial intelligence (AI). We have structured the narrative around four critical themes: biases in colonial archives; postcolonial digitization; linguistic and representational inequalities in Lusophone digital content; and technical limitations of AI models when applied to the archival records from Portuguese-colonized African territories (1640–1822). Through three case studies relating to the Africana Collection at the Arquivo Histórico Ultramarino, the Dembos Collection, and Sebestyén’s Caculo Cangola Collection, we demonstrate the infrastructural biases inherent in contemporary computational tools. This begins with the systematic underrepresentation of African archives in global digitization efforts and ends with biased AI models that have not been trained on African historical corpora.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press.

Introduction to Knowledge Production in the Black and African Digital Humanities: The Intersection of Technology and African Scholarship

The Digital Humanities have emerged as a transformative field that is reshaping how knowledge is produced across disciplines in both the Global North and South, drawing contributions from academics and non-academics.Footnote 1 As a multidisciplinary approach that combines computational tools with the humanities, it has found wide application within scholarly practices, with the United States unquestionably leading the field.Footnote 2

Within this broader landscape, various approaches have surfaced at the crossroads of Digital Humanities and Black Studies. In particular, Susanna Allés-Torrent identifies four main epistemological currents: Black Digital Humanities, Postcolonial Digital Humanities, Data Feminism, and Digital Black Atlantic. She shows how the intersection of modern tools with a broadly understood Black History intervenes in the fields of cultural critique, postcolonialism, decoloniality, feminism, and social justice.Footnote 3 The latter in particular was the driving force behind the founding of Black Digital Humanities (BHD), which, according to Kim Gallon, lies in combining digital tools with Black Studies rooted in the civil rights and Black nationalist movements. On this basis, Gallon argues for the “technology of recovery,” that not only seeks to recover the voices and narratives of colonized Black diasporas but also critically interrogates the digital infrastructures used in this process. In this context, Safiya Umoja Noble added that computational processes are often fraught with racialized and hegemonic biases, which can further marginalize Black narratives in digital systems.Footnote 4 According to Roopika Risam, reclaiming and democratizing digital spaces is essential to empowering marginalized communities and improving representation in knowledge production, as well as creating platforms for voices that go unheard in dominant narratives about literature, history, and culture.Footnote 5

Starting from the concept of “recovery technologies,” this paper looks at its practical side, in particular the digital challenges that scholars, archivists, and technologists face when engaging with the Luso-African experience while questioning the digital systems on which they rely. We observe three main approaches worthy of brief discussion: the digitization of historical sources of African/Black diaspora communities, relational datasets and the proliferation of digital humanities projects focused on African/Black pasts, each of which creates potential biases in knowledge production.

The digitization of slave narratives and their broad accessibility was a central concern of DH practitioners in the United States in the early 1990s. Marcus P. Nevius highlighted that African American history was a pioneering field in the early days of digital scholarship. According to Nevius, projects such as Documenting the American South, led by the College of North Carolina Libraries, and the Trans-Atlantic Slave Trade Database were particularly important. The former resulted in digitized collections of slave narratives, while the latter produced CD-ROMs with data on over 30,000 transatlantic slave trade voyages. However, this focus also had its drawbacks, as Nevius found that more attention was often paid to the slave narratives, while other historical agents, such as the trade ledgers, took a back seat.Footnote 6 In fact, this already shows an early bias in the Black Digital Scholarship, as decisions about which sources to digitize and make publicly accessible were shaped by audience interest rather than strict archival standards.

After the digitization phase, two further approaches can be applied: the creation of structured relational databases and the development of digital, user-friendly projects. Sara Collini argues for the first approach, stressing the importance of well-defined and organized metadata that can be cross-referenced with other datasets according to the principles of Linked Open Data. In practice, this approach allows for the inclusion of new data from historical sources, the creation of new ontologies and the integration of these into existing metadata frameworks. However, Collini also warns of the potential for “digital violence” in the process of metadata creation when those categories originally imposed by colonial agents are repeated and such biases are transferred into digital infrastructures.Footnote 7

The third approach involves the creation of user-friendly digital projects designed for broad public access. Within African-oriented digital humanities, such initiatives generally fall into two categories: projects about Africa that are developed and run in Western countries and those that are initiated in Africa by Africans themselves.Footnote 8 Jennifer Guiliano and Roopika Risam noted that many of these initiatives are based in North America and Europe, while relatively few originated in Africa itself.Footnote 9

Examples of projects in the first category are Slave Voyages, Freedom Narratives, Freedom on the Move, and Dambudzo Marechera Archive, to name but a few. In the second category, South Africa is the undisputed leader in digital projects in the region. Among its pioneering efforts were Digital Imaging South Africa, funded in 2002 at the University of KwaZulu-Natal in Durban, and Timbuktu Rare Manuscripts Project. Between 2018 and 2023, several universities in South Africa worked on a program dedicated to African Digital Humanities. These institutions include the University of Cape Town, the University of Pretoria, Stellenbosch University, the University of the Western Cape, and the University of the Witwatersrand.Footnote 10 The Digital Humanities Association of Southern Africa still considers this field to be relatively underdeveloped in the region.Footnote 11 Despite the country’s well-functioning data protection regulations, there are still problems with data sharing. In contrast, Rwanda has made the most progress in the region in using local data for AI algorithms through initiatives such as the National Data Strategy.Footnote 12 Although such initiatives are important and necessary, it is worth mentioning that Rwanda’s National Data Strategy is recent and has not yet been completely implemented. Hence, its effect and potential to change the current scenario cannot be assessed.

Other examples of more recent African-led digital initiatives include Open Restitution Africa, Paint Me Black, African Digital Heritage, Pollicy, and The Corpus of Nigeria New Media Discourse in English Project (CONNMDE).Footnote 13 Those digital projects that target African societies and Black diasporas share some common characteristics in the realm of BHD that Kim Gallon defines as an approach that uses digital platforms to reclaim Black humanities.Footnote 14 In both contexts, those projects focus on promoting broader social engagement through digital platforms, using materials such as podcasts, webinars, reports, and videos to encourage community participation.Footnote 15 Both seem to appeal to both academic and non-academic audiences, a feature Gallon emphasizes as an important element of BHD. Ravynn K. Stringfield also pointed to the importance of digital platforms for self-creation, collaboration, and community building within Black communities and how the digital sphere contributes to the development of self-consciousnessFootnote 16 Fungai Machirori, on the other hand, points out how important it is to construct African self-representation in the digital sphere in order to simultaneously challenge Western-influenced stereotypes and the narratives of African digital elites and to argue for transnational social integration.Footnote 17

Despite social engagement, the evaluation of which does not fit within the scope of this article, we can identify further distortions in knowledge production. A disproportionate number of projects are based in the West, follow Western paradigms, and are mainly conducted in English. In contrast, there are significantly fewer African-led initiatives, almost none of them in local languages, where the inclusion of African digital infrastructure development is often overlooked. It is worth explaining that the misrepresentation of languages other than English in the digital realm refers not only to African countries, but also to the United States itself. Allés-Torrent criticizes the English language monopoly in Digital Humanities’ projects discussing how Iberian languages such as Spanish and Portuguese are underrepresented in the North American digital sphere, thus calling for resistance to linguistic imperialism.Footnote 18 Finally, one of the most tangible biases in knowledge production is infrastructure development. Noble argues that modern technologies are grounded on neo-colonial violence, rooted in the extraction of raw materials in the Global South to fuel technological advancement in the Global North. This process is often based on racialized violence and environmental degradation, all in service of meeting the technological and social demands of the Global North.Footnote 19

The so-called “digital turn” has developed very differently in North America and Africa. In the United States, it began in the mid-1990s, supported by well-funded projects and solid infrastructure, and focused on expanding access to historical materials through digital libraries. This encouraged free, open-access projects that led to widespread digital openness.Footnote 20 In contrast, Africa’s digital turn has been a topic of academic debate since the beginning of the twenty-first century, which was more a critical reflection on the lack of infrastructure, resources, and strategic digital planning. At that time, Peter Limb criticized the inequalities in access to digital technology between the Global North and Africa and introduced the term “information imperialism” to describe these disparities. The West’s digital lead was due to factors such as high internet costs, inadequate infrastructure, and limited local resources in many African regions, which only widened the global digital divide.Footnote 21 Similar concerns about digital imperialism were expressed around the same time by Yunusa Z. Ya’u. He noted that Africa was poorly positioned in cyberspace, including a low share of global phone connections, limited access to digital technologies, an underrepresentation of African content and languages, and the low visibility of African governments on the internet. These factors have only deepened the digital divide between Africa and the Global North.Footnote 22 According to James T. Murphy, the long-awaited global information revolution ten years later, which was originally expected to end the digital isolation of African populations, was in fact hierarchically structured, which has only reinforced marginalization and deepened Africa’s digital periphery.Footnote 23 The growing use of AI models in Africa raises significant concerns about digital colonization, particularly in low-resource settings where existing inequalities can be exacerbated by the digital divide and external dominance over digital resources. Without incorporating local data into AI algorithms, these technologies risk perpetuating and deepening these disparities.Footnote 24

In response to digital marginalization, two parallel processes have emerged in Africa: an exogenous one, dependent on global information flows, and an endogenous one, centered on a turn toward local digital practices. The exogenous process reflects what Fabienne Chamelot, Vincent Hiribarren, and Marie Rodet have termed “digital imperialism.”Footnote 25 They described a scenario in which institutions from the Global North dominate African digitization projects, mainly due to their dependence on funding from international donors such as the British Library and private foundations. Although these initiatives had good intentions, they often focused more on preserving endangered documents than on developing long-term strategies for local archives. This focus on short-term digitization projects neglected the urgent need to preserve and support Africa’s physical archives.

The endogenous process, by contrast, involves strategic decisions by African governments to invest in their digital infrastructure and position it as a flagship for national development. An example of this is the Southern African Development Community (SADC) strategy for 2020–30, but while the economic benefits of digital integration take center stage, the protection of digital human rights often remains secondary. Many experts are concerned about the rise of digital dictatorships in the region, which can violate digital human rights through restricted access to information, censorship, state-sanctioned internet shutdowns, and unregulated data extraction, among other things.Footnote 26

All of the above concerns have directly influenced our perspective as digital practitioners working in the field, particularly in applying DH methods to primary sources related to African colonial history. Our understanding of the concept of “technology of recovery” addresses biases on three fronts: historical sources, their accessibility/digitization/preservation, and AI systems. In the Lusophone context, these issues present numerous challenges, including limited digital access to primary sources in Africa, the lack of long-term preservation and cataloging strategies for local Luso-African archives, fragmented and inconsistent digitization initiatives, and the disappearance of historical actors from the digital landscape. Collectively, these problems intensify biases in computational analysis and contribute to a distorted framework of knowledge production within the Luso-African Digital Humanities.

As a practical illustration of these problems, we turn to three Luso-African collections as a case study: the Africana collection produced by Portuguese colonial agents, now housed in the Arquivo Histórico Ultramarino (AHU) in Lisbon; the Dembos Collection, originally part of the State Archives of Dembos Caculo Cacahenda in Angola; and the Sebestyén’s Collection Caculo Cangola.Footnote 27 To do this, we adopt a DH methodology grounded in AI-based techniques for the extraction and analysis of historical corpora. This methodology relies on raw data to train a range of models, primarily through natural language processing (NLP) tools. Working with raw data has already been successfully implemented in projects such as Slave Voyages and Slave Narratives, which collect and extract data on enslaved people from historical sources in various archives, including in Africa.Footnote 28 In our work, instead of manually extracting data from the data sources, we automate this process using NLP and generative AI tools. The tools we employ are aligned with efforts like those of the South African Center for Digital Language Resources (SADiLaR), which specializes in language technology.Footnote 29

The motivation behind this paper is to emphasize that access to data is essential for efficient and accurate training of AI models. The limited digital availability of African historical archives and data resources significantly hinders the training model—that is to say, teaching algorithms to recognize patterns and generate predictions or decisions based on historical data. In digital history, this creates representative imbalances and biases that affect the model’s ability to extract information from African sources.

In this study we assume that while a one-size-fits-all digital approach would be ideal for all Portuguese colonial correspondence records, its practical implementation requires nuanced adaptations to accommodate the linguistic and regional specificities of African-based catalogs. Furthermore, the integration of African languages and customized digital methods is crucial for a better understanding and analysis of African historical documents.

To clarify our approach, this article is divided into six sections. Following the introduction, section two, “Bias in Colonial Archives,” examines archival practices in West Central Africa during the Portuguese colonial period and discusses the coexistence and limitations of the local and Portuguese archival systems. Section three, “Bias in Postcolonial Archives,” looks at digitization efforts and the uneven progress in making archival records available online. We emphasize that the absence of local African archives in the digital sphere is a crucial factor contributing to biases in the construction of knowledge about Africa’s past. This problem is exacerbated by the underrepresentation of African languages, particularly African Portuguese dialects, while content creation is dominated by Brazilian and European Portuguese. This is discussed in section four, “Biases in Lusophone Digital Content.” In section five, “Biases in Systems,” we present our experiments with machine learning and large language models that we applied to Luso-African archival records. In this way, we show how data-related limitations limit the effectiveness of AI-powered historical research. Finally, in section six, “Challenges and Recommendations,” we summarize our observations, discuss the challenges of data scarcity, and provide recommendations for building a more inclusive African Digital Humanities.

Bias in the Colonial Archives: Towards Literacy and Early Modern Archival Practices in West Central Africa during the Portuguese Colonization

The first step in applying the “technology of recovery” to rethink Black and African voices in the Portuguese digital landscape is to examine archival practices during the colonial period. Understanding the ways in which documents were managed by both Portuguese colonial officials and African representatives, particularly in terms of their accessibility or inaccessibility, helps to visualize whose histories were preserved and whose were silenced, and how these historical imbalances still influence digital representation today. Furthermore, the preservation of these records in the twentieth century requires attention, particularly concerning the challenges in accessing local African archives.Footnote 30 These historical and contemporary practices contribute to inherent biases, which, in turn, influence computational models.

To understand these distortions, we must start with the broader historical and archival context of the Portuguese presence in West Central Africa. This presence dates back to 1483, when Diogo Cão reached the mouth of the Congo River. However, it was not until 1576 that Portugal founded its first settlement, São Paulo de Luanda. From its foundation until the eighteenth century, a network of forts and outposts was built and consolidated to control the territory and trade routes. Notable fortresses along the Cuanza and Lucala rivers were Massangano (1583), Muxima (1599), Cambambe (1604), Ambaca (1614), and Pedras de Pungo Andongo (1671), while further south stood São Filipe de Benguela (1617) and to the north São José de Encoge (1759).Footnote 31

In the years following the expulsion of the Dutch in 1648, the new governors, many of whom came from Brazilian states, initiated military actions aimed at consolidating their power and obtaining a greater number of enslaved people.Footnote 32 Despite the peace with Queen Nzinga of Ndongo and Matamba and the defeat of the Kongo at the Battle of Mbwila in 1665, there were a series of events in the following decades that led the Portuguese to withdraw from the central areas and concentrate their efforts on the south.Footnote 33 Kisama and Libolo posed a major challenge for the governors, as these regions resisted even in the face of several military campaigns.Footnote 34 Additionally, in the eighteenth century the governors concentrated their efforts on administrative tasks and the territories under their direct control, in particular the tributary communities, and the lands along the Bengo, Cuanza, and Lucala rivers and east of Massangano.Footnote 35

Parallel to the territorial expansion, the Portuguese submitted to the sobas, the traditional local authorities who acted as village chiefs or community leaders, or accepted their vassalage. Between the late sixteenth and early seventeenth centuries, the Portuguese divided their territories and introduced a system similar to the Spanish encomienda. From 1607, however, the sobados, the territories ruled by the sobas, became vassals of the Portuguese crown. As vassals, they not only had to pay tribute, often in the form of slaves, but also provide military support for the Portuguese army. The ongoing interactions with the Portuguese, due to economic and political dependencies, led some of them to gradually adopt written culture as a means of managing these relationships. These interactions fostered extensive correspondence between the sobas and the Portuguese monarch, as well as between local African leaders.Footnote 36

It is important to note that there are no records indicating the use of written language in Angola before the arrival of the Portuguese. The societies that inhabited the region were predominantly oral. However, through continued trade and diplomatic contacts with the Iberian monarchy, they gradually began to adopt the Portuguese writing system.Footnote 37 Nevertheless, there were alternative methods of communication, including the use of pictograms, which served as written support for the oral tradition. The continuous contact between the Portuguese and local populations led to the latter adopting Portuguese methods and techniques for teaching and learning to read and write.Footnote 38 From the late fifteenth century, the Catholic missions in West Central Africa introduced writing techniques, which led to local elites recognizing the importance of written documentation, especially for correspondence.Footnote 39 In the Jesuit colleges in Luanda and Mbanza Kongo, the sons of local leaders were taught the art of writing.Footnote 40 The arrival of the Capuchins (and to a lesser extent, the Carmelites) in the interior of the country further contributed to the spread of literacy.Footnote 41 Proof of this is the “Monumenta Missionária Africana,” in which António Brásio transcribed numerous letters from missionaries in the region, thus preserving a wealth of information about the African kingdoms of the region.

The emergence of the literate, defined by European standards of education, local elite and the growing need to collect correspondence and other documents—including those relating to property, contracts, or inheritance—most likely led to the establishment of the first African archives in the region. For example, mwene Kongo, king of Kongo, which was one of the major kingdoms with long-standing contacts with Europe, did indeed develop a variety of diplomatic and correspondence relationships.Footnote 42 Yet, more information about these archives and their possible contents is scarce due to different factors such a loss of documents and local conflicts.Footnote 43 The latter could have led to these documents being scattered or destroyed, which may have been the case several times between the sixteenth and seventeenth centuries For example, during the invasion of Jaga in 1568 or after the abandonment and destruction of the kingdom’s capital, São Salvador, in 1678. According to John K. Thornton, some documents were even kept after the death of King António I in 1665. The proof of this is an instance when the Capuchins met Pedro IV in Kibangu in 1698; the king showed the missionaries a papal bull from 1677, a year before São Salvador was destroyed.Footnote 44

There is also other evidence for the existence of African royal archives. After the reunification of the kingdom, the royal archives of the Kongo were restored.Footnote 45 We learn this from Zacharia da Cruz, who was given access to selected documents in 1858, which he then reproduced with the Queen’s intercession. Later, in 1880, Alfredo de Sarmento wrote in “Os sertões d’África” about his stay in Mbanza Kongo and his visit to the kingdom’s archives. According to Sarmento, the king’s son, D. Álvaro, facilitated his access, as he held “the eminent position of personal secretary of state and was entrusted with the preservation of the royal archive and all official correspondence,” which was “duly endorsed by the respective ministers.”Footnote 46 However, at this time, Sarmento was also warned that the reproduction of any documents was forbidden under penalty of treason. Nonetheless, he found in these archives important documents for the kingdom’s history, especially concerning relations between Portugal and the Central African kingdom.Footnote 47 Yet the contents of these archives remain unknown to us, as Louis Jadin later discovered in 1952 that they had been destroyed during the Buta War in 1918.Footnote 48

In the nineteenth century, reports by European explorers attested to the importance that the Kongo elite placed on the value of documentary evidence (known as ius archivi) and the existence of smaller, private archives maintained by members of the Kongolese elite. One example comes from the German anthropologist Adolph Bastian, who in 1856 described how “one of the visitors showed me his diploma of Knight of the Order of Christ, issued in the name of the king and authenticated with the red seal of the Kingdom of the Congo, for Dom Domingo de Agua Rosada… it is curious to see this same obsession with ridiculous titles that has prevailed here for so long, and which has recently reappeared in Haiti.”Footnote 49

While the Kingdom of Kongo was the largest in the region to adopt writing, smaller ones such as the Dembos also developed written traditions. Located between the Dande and Bengo rivers, these smaller political organizations also cultivated a written culture in the eighteenth and nineteenth centuries, with the oldest surviving document dating back to 1661, a result of contact with Portugal. Although the original organization of these documents is not known, the extant collection contains records from the Dembo states of Caculo Cahenda and possibly from Ngombe Amuquiama, Mufuque, and other jurisdictions under the colonial outpost of Fort of Dande.Footnote 50 Among the materials gathered in 1934 by António de Almeida, the most notable are letters exchanged between these lords and the Portuguese-Angolan authorities and the king of Kongo, as well as wills, acts of vassalage, and judicial proceedings. Furthermore, the documentation is not limited to Portuguese and Latin but also features elements written in Kimbundu.Footnote 51 Here it is important to explain that although Portuguese was the official language of communication, there were more than forty languages in Angola, most of which are of Bantu origin, such as Kikongo, Umbundu, Chokwe, and Kimbundu. During the early Portuguese expansion, the colonizers encountered Kimbundu-speaking kingdoms (Dembos, Ndongo, Matamba, and Kassanje). Just a few years after their arrival, the Jesuits heard confessions in Kimbundu and wrote the first catechisms to support the spread of Christianity.Footnote 52 As in other parts of the Luso-Portuguese empire, interactions between the local population and the smaller European settler population resulted in racial and cultural mixing. This led to the emergence of Luso-African communities, many of whom spoke Kimbundu as their mother tongue. In the late nineteenth century, members of these communities often held influential positions in the administration and the military.Footnote 53 Jan Vansina argues that Kimbundu was of particular importance in Portuguese Angola into the nineteenth century, especially in Luanda, where it was widely used as a lingua franca.Footnote 54

In contrast to these African royal archives, Portuguese officials also set up several archives in the region. However, little is known about the extent and size of these archives in the Luso-African territories, although some institutions may have preserved parts of them.Footnote 55 In Angola, for example, at least three institutions administered their archives between the seventeenth and nineteenth centuries: the library of the Episcopal Palace, which was under the jurisdiction of the bishop, the library of the Luanda City Hall, and the archives of the Museu de Angola. In other Portuguese-African territories, the municipal chambers and diocesan authorities generally played a similar role in preserving documents, especially in Cabo Verde and São Tomé. In addition, they also included documents kept in religious monasteries.Footnote 56

However, the documents preserved in the royal archives of Africa were exposed to numerous threats. In addition to the damaging effects of the climate, human activities such as conflicts, fires, and shipwrecks have also contributed to the loss of many of these records. During the Dutch occupation of the city, much of the Câmara Municipal’s records were lost. Soldiers from the Dutch West India Company (WIC) intercepted a convoy carrying the documents and threw them into the Bengo River.Footnote 57 This is the reason why many of the records that exist today date from 1649, when the Portuguese regained control of Angola. Yet, the restoration of the Portuguese historical records in Luanda faced numerous challenges. Joseph C. Miller highlighted the different approaches of Portuguese governors regarding the preservation and transmission of these documents. While some records were publicly documented, many others were kept in private archives, making long-term archival practices more difficult. The preservation of correspondence also depended heavily on the administrative instruments chosen by individual governors, such as alvarás (royal decrees), bandos (proclamations), or patentes (appointments). These practices were significantly influenced by the style of government of the Portuguese governor in Luanda at the time. Another obstacle was the lack of qualified public scribes; the first professional scribe funded by the crown was not appointed until 1688 under Governor João de Lencastre (1688–91). Furthermore, the archival practices lacked clear protocols for recording correspondence exchanged with Lisbon. Records were not always systematically separated according to whether they were sent, received, or destined to the metropolis of Lisbon or to other colonies. The first significant reorganization of Angola’s archives took place during the administration of Francisco Inocêncio de Sousa Coutinho (1764–72).Footnote 58

From 1642, the Conselho Ultramarino (Overseas Council) played a central role in managing the correspondence of the Lisbon court with its African colonies, in particular with the governors and other colonial officials stationed there. This council served as the central political authority overseeing colonial affairs between Portugal and its overseas territories. The council, which at the time was composed of highly experienced officials who were not only familiar with colonial affairs and territories but had also served overseas on occasion, had a significant influence on the development of early modern Portuguese overseas policy. Lisbon’s communication with its colonies took the form of various types of correspondence, including a large number of letters. Among the most frequent were petitions, royal decrees, and similar official documents. During the week in which the Overseas Council met, each colonial region was dealt with on specific days. Asia and East Africa were dealt with most frequently, namely three times a week, on Mondays, Tuesdays, and Wednesdays. Brazil was dealt with twice a week, on Thursdays and Fridays. While West Africa was only dealt with once a week, on Saturdays.Footnote 59 This division shows that the correspondence from overseas was organized geographically and not thematically. The African collection was further organized by region and included Angola, Cabo Verde with Guinea, Mozambique, São Tomé and Príncipe, and North Africa. A similar structure was applied to the Brazilian collections, with the documents sorted by administrative regions such as Bahia, Rio de Janeiro, and São Paulo. In addition, the Portuguese archipelagos in the Atlantic, Madeira and the Azores, were included as territories under Portuguese jurisdiction.Footnote 60

This collection is now housed in AHU, an institution established in 1931 to preserve, organize, and document materials critical to understanding different aspects of Portuguese colonization.Footnote 61 The initial management under the first two directors, António José Pires Avelanoso and Manuel Múrias, played a decisive role in the transfer of documents from various archives to the newly founded institution.Footnote 62 While the AHU centralized and safeguarded these records, its creation marked only the beginning of a broader effort to institutionalize colonial archives, both in Portugal and in the overseas territories. Efforts to preserve colonial records commenced with the founding of the Arquivo Geral e Histórico da Índia Portuguesa in Goa in 1930, the Arquivo Histórico de Moçambique in Lourenço Marques in 1934, and the Museu de Angola in Luanda in 1938. As Gautier Garnier notes, the AHC itself did not contribute directly to these colonial archiving initiatives.Footnote 63 However, this legacy would later face major challenges in the postcolonial era as new African nations struggled to preserve and systematize these historical records.

Bias in Postcolonial Archives: From Forgetfulness to Selective Digital Memory

Building on the historical context of archiving and the institutional efforts described above, the challenges facing colonial and postcolonial archives illustrate the twin problems of preservation and neglect of heritage. While the systematization of historical documentation was already a challenge in the early modern period, the long-term sustainability of these archives also became a challenge in the African nations that transitioned to independence.Footnote 64 These difficulties persist today as many archives suffer from neglect, environmental degradation and insufficient resources.Footnote 65 The post-independence financial crises in many African countries have further complicated efforts to adequately fund the cataloging and systematic management of state archives.Footnote 66 In the context of Lusophone Africa, the newly independent states after the Carnation Revolution in 1974 faced urgent challenges that diverted attention and resources away from archival administration. After independence in November 1975, the Angolan government focused more on building government structures. Practical steps towards the institutionalization of historical archives and the prioritization of history as a discipline were a long time coming. However, from 1975 onwards, the Arquivo Nacional de Angola was under the direction of Carlos do Couto. His efforts to systematically catalog the archive’s documents were decisive in preserving and facilitating the study of Angolan history.Footnote 67 In the case of Cabo Verde, Marilla MacGregor makes an important observation: “the first libraries created in Lusophone Africa were meant for the use of Portuguese officials and not for the community at large.” The creation of the Cabo Verde National Historical Archive in 1988, along with other local libraries, was an important milestone in the educational reforms of the 1980s, which emphasized literacy and cultural preservation. However, Cabo Verde has long faced persistent challenges, including limited funding, a lack of trained professionals and inadequate maintenance of facilities.Footnote 68 Isaías Barreto da Rosa points out that there are several challenges to building an inclusive digital information society in Cabo Verde by 2025. One of the main problems is the limited access to technology: only 11.4 percent of households have computers and only 23.9 percent of inhabitants have access to the internet. In addition, the country does not have a National Digital Library.Footnote 69 These difficulties are likely reflected in the archives of other Lusophone African countries such as Mozambique, São Tomé and Príncipe, and Guinea-Bissau. Nevertheless, Angola, Cabo Verde, and Mozambique launched a joint project in 2025 to preserve the documentary heritage of humanity. The initiative, entitled “Recenseamento dos Escravos em Angola, Cabo Verde e Moçambique (1854),” aims to collect registration books and gradually add these local documents to the UNESCO Memory of the World Register.Footnote 70

In addition to the issues mentioned above, there is an even more pressing problem in several African countries such as Angola, Nigeria, Mozambique, Namibia, South Africa, and Zimbabwe, where governments often systematically silence historical memory. These governments often refuse to open public archives out of political interests, thus restricting access to the past. Justin Pearce describes this phenomenon as a struggle over memory politics—that is to say, the hegemonic efforts of post-liberation governments to construct dominant historical narratives. He emphasizes that the process of nation-building is often rooted in nostalgia.Footnote 71 This dynamic contributes to what James Yékú calls historical amnesia, a political strategy aimed at erasing marginalized voices from the past.Footnote 72 In response to these challenges, individuals across Africa are seeking alternative ways to preserve and engage with memory, ranging from institutional initiatives to grassroots efforts. One example is the Tchiweka Documentation Association (ATD) in Angola, founded to preserve and disseminate archives related to the liberation struggle. The ATD houses the private collection of Lúcio Lara, a member of the People’s Movement for the Liberation of Angola.Footnote 73 On a more informal level, social media provides accessible and community-run spaces for sharing memories, where projects such as the “Memórias da Luta Armada em Angola” and “The Nigerian Nostalgia 1960 -1980 Project” allow users to share personal memories and images to promote collective remembrance.Footnote 74 For Yékú, social media acts as a technology of recovery, making the internet an important resource for the reconstruction of African history. As digital practitioners and historians, we need to fill these gaps left by the biased official archives and remain open to alternative sources of knowledge.Footnote 75

Consequently, there are many publicly available, digitalized, and cataloged historical records that document the varying extent of relations between Portugal and its former African colonies during the early modern period that are located outside the African continent, particularly in Portugal. The dispersal of African archives across different countries, institutions, and even private collections has long been a central theme in postcolonial discourse. As early as the beginning of the twenty-first century, Limb argued for the necessity of digital repatriation using the example of Namibia’s intellectual heritage, whose historical documents were scattered across Germany, Great Britain, and South Africa.Footnote 76

The Luso-African records are currently accessible in four major Portuguese repositories: the National Archives of Torre do Tombo, Biblioteca Nacional, Biblioteca da Sociedade de Geografia de Lisboa, and the Arquivo Histórico Ultramarino, all in Lisbon.Footnote 77 The latter is of particular importance for our research, as it preserves documents from the previously discussed Conselho Ultramarino and contains extensive catalogs from the Africana Collection.Footnote 78 This collection, so named by José C. Curto, comprised a large number of materials, including correspondence with Mozambique, Angola, Cabo Verde, Guinea and São Tomé, as well as a general catalog of codices and an inventory of manuscripts.Footnote 79

Efforts to make the collections of AHU accessible to a wider public were uneven, with notable differences between the Africana and Brazilian catalogs. The latter, known as “Projeto Resgate Barão do Rio Branco,” underwent intensive documentary treatment starting in 1983. In 1995, the project was formalized through a protocol between Portugal and Brazil, which led to the creation of the bilateral Luso-Brazilian Commission for the Preservation and Promotion of Documentary Heritage.Footnote 80 By 2006, thirty-one digital inventories had been created as part of the collaboration with various national and international institutions. Since 2015, the initiative has been funded by the United Nations Educational, Scientific and Cultural Organization (UNESCO).Footnote 81 The project has also benefited from the development of various dissemination tools, including online platforms, CD catalogs, and printed reference works highlighting selected records.Footnote 82

In contrast, the Africana collections were rather scattered and less systematically organized. Although all the collections were under the supervision of AHU, the involvement of several funding institutions contributed to the fragmentation of the collection. As early as 1988, Curto pointed out the challenges within the Angolan collection, noting that it required fifteen rooms for storage alone. He lamented the lack of suitable instruments for an efficient search in the collection and noted that although it was the largest, it was also the most disorganized.Footnote 83

These research initiatives culminated in the inventory of the Africana collection. By 2024, the documents from Cabo Verde were fully digitized and made accessible online via the archive’s website. In contrast, documents from Angola and Guinea were only partially digitized, and those from Mozambique remained neither digitized nor microfilmed. However, the inclusion of these documents in the Recovery and Resilience Plan indicates that these gaps could be closed in future digitization measures.Footnote 84

Having considered these points, it is important to reflect on the implications for digital humanities practitioners. Early cataloging efforts of colonial documents, such as the inventories published in the late twentieth century, provided an important starting point, but were hampered by a lack of standardization. For example, when the Arquivo Científico Tropical in Lisbon initially made these catalogs available as PDF files, they lacked essential metadata necessary for comprehensive research. These limitations underscore the urgent need to standardize, thoroughly document, and digitally optimize archival collections to meet the demands of modern digital research and scholarship. Data transformation techniques, including encoding and format adaptation, are essential to prepare the data for modern computational tools.

Despite the development of digital archiving practices in Portugal and the increasing availability of metadata on online platforms, significant problems persisted. Human-introduced errors such as typos, misspellings, and omissions of information during the digitalization process created inconsistencies that undermine the reliability of metadata for analysis, highlighting the need for further refinement of this process. Data cleansing processes, such as addressing missing values and standardizing formats, are critical to overcoming these challenges and ensuring data quality. To overcome these limitations, digitization experts should adopt principles such as Linked Open Data (LOD), which provides a structured and relational framework for organizing and enriching historical datasets. LOD is based on relational database structures and semantic web technologies so that the data is interoperable. This approach facilitates the creation of cross-references between related datasets and can be linked to both internal and external data sources, improving the connectivity of historical research.Footnote 85

What could this mean in practice for the Africana Collection? Many records could be easily cross-referenced using platforms such as Wikidata, a centralized data repository that can be read by both humans and machines. Spatial data sources such as the World Historical Gazetteer could also be integrated to add information on precolonial African subregions.Footnote 86

However, other Luso-African historical records, produced not by Portuguese colonial agents but by African rulers, charted different paths. The Dembos Collection originally belonged to the State Archives of Dembos Caculo Cacahenda in Angola, before being brought to Lisbon in 1934 by Antonio Almeida as part of the Portuguese ethnological missions in African territories.Footnote 87 The collection entered the AHU between 2006 and 2016, after which it was digitized and made publicly accessible. Notably, the metadata identifies Portuguese as the language of the documents, overlooking the presence of Kimbundu.Footnote 88

Sebestyén’s Caculo Cangola Collection was compiled between 1986 and 1988 by Éva Sebestyén during her anthropological fieldwork in the villages of the Dembos and Samba Cajú, today part of the provinces of Bengo and Kwanza Norte, with the official permission of the respective provincial governors. During her visit to the villages of Kimbundo and Caculo Cangola, Sebestyén came across materials from the colonial era, which were kept in small wooden boxes and also contained traditional insignia of power. Among them were handwritten documents in Portuguese, some of which dated back to the late eighteenth century. She later discovered another box in Ndalatando containing typewritten copies of documents from the eighteenth and nineteenth centuries that originally came from the community of Samba Cajú. During her research, she photographed and transcribed many of the documents. The current collection consists of these transcriptions (not the photographs) and is available in PDF format, although, as far as we are aware, it has never been published.Footnote 89

More broadly, the limited access to the digital holdings of local African archives leads to a significant bias in the production of knowledge about colonial Africa in the digital humanities. The limited online availability of the collections of the National Archives of Luanda, for example, hampers our ability to fully understand the wealth of documents beyond the Portuguese presence in the region. Important artifacts such as the Carta Patente of 1591, written on lambskin and one of the most valuable pieces in the archive, and significant regional treaties from areas such as Cabinda, Xifuma, Simulambuco, and Dembos, remain largely inaccessible.Footnote 90 Digital access to the catalogs, especially those from the provinces of Huíla and Cunene, could enrich our training models with historical data. These catalogs contain information about sobas, their immediate environment, family members, politics, and treaty-making. In addition, they could provide details about names, lands, people, and geographic spaces that were central to local historical narratives.Footnote 91 If the digital space is not fed with African data, there is a risk that local linguistic variations, cultural nuances, and historical place names will be ignored by the training models.

This risk is compounded by the fact that most of the available data is shaped by instruments and frameworks that predominantly reflect Western paradigms of knowledge.Footnote 92 For example, the famous historical figure, the female ruler Nzinga of Ndongo and Matamba is represented on Wikipedia and has entries translated into thirty-eight languages. On Wikidata, her profile is enriched with metadata such as gender, country of citizenship, given name, title of nobility, dates and places of birth and death, family connections (including father and siblings), official positions, place of residence, and religious affiliation.Footnote 93 This case is similar to Nevius’s observation about the slave narratives being more attractive to a wider audience and therefore tending to receive more digital attention. The cultural representation of Nzinga of Ndongo and Matamba has become marketable in popular culture, which has increased her digital visibility. At the same time, this case illustrates the form of digital violence that Collini warns against: categories such as gender, title, and religious affiliation are imported into digital infrastructures that are shaped by Western epistemologies, often at the expense of indigenous worldviews and cosmologies of Central West Africa. In contrast, lesser-known sobas such as Becampolo Có, a local ruler from the late seventeenth century in Bissau, remain virtually invisible in the digital world, while the names of many other sobas have been entirely lost.Footnote 94

Bias in Lusophone Digital Content: (Mis)Representation in Portuguese-Language Contexts

In this discussion, we emphasize once again that data is critical to the development and success of digital models because it determines their ability to learn, generalize, and perform accurately.Footnote 95 High-quality and relevant data enables models to recognize and internalize correct patterns, ensuring effective generalization to new, unseen data.Footnote 96 This capability is crucial for the use of AI in scenarios where the models need to perform robustly in different contexts.

The problem of availability and access to data in Portuguese has long been the subject of debate.Footnote 97 Important initiatives such as the creation of the Historical Dictionary of Brazilian Portuguese (HDBP) aimed to create a dictionary from historical texts spanning two centuries (the sixteenth through eighteenth centuries).Footnote 98 Its main goal was to normalize the different historical variations in vocabulary. In particular, a strong linguistic focus was placed on the vocabulary of historical Portuguese texts, which could have considerable potential for the creation of various linguistic tools such as semantic labeling, automatic text search and segmentation, and corpus linguistics.Footnote 99 Consequently, various historical corpora and text collections in Portuguese—both European and Brazilian—have been created that served as resources for tasks such as automatic transcription.

One of these is “Corpus do Português,” which contains a billion Portuguese words from a million pages.Footnote 100 It is divided into three sections: Genre/Historical, with a section containing 45 million words from 1200–1900; Web/Dialects, with data collected from websites from four Portuguese-speaking countries (Brazil, Portugal, Angola, and Mozambique) in 2013–14; and analogical NOW, collected from 2012–19. Although the word “soba” appears with 184 entries in the historical corpus, most of the cases come from Brazilian novels such as “O Braço Direito” by Otto Lara Resende and “Os Bruzundangas” by Lima Barreto, as well as from Portuguese fiction such as “Terra Morta” by Fernando Monteiro de Castro Soromenho. None of the derivations come from African archives.Footnote 101 This case shows the overrepresentation of Brazilian sources and the underrepresentation of African sources in the historical documentation and portrayal of sobas.

A similar problem refers to the availability of African journals in the digital sphere. African Journals Online (AJOL), originally based in Oxford and relocated to Africa in 2005, was optimistically praised by Limb that same year as a successful tool for promoting African scholarly publishing.Footnote 102 As of 2025, however, South Africa leads the region with 111 journals available online. In contrast, Angola offers just five, Mozambique only one, and Cabo Verde has none.Footnote 103

This example of unbalanced data representation required for satisfactory training data is a critical example of the challenges of addressing bias and fairness in AI systems. Data that is not representative of different demographics or contextual spectra can lead to biased results that jeopardize the fairness of the model and its ethical use. Therefore, curating a representative and balanced data set is essential to mitigate these risks and promote fair AI systems.Footnote 104

Textual manuscripts such as historical documents, academic papers, and literary works provide valuable data sources for AI systems. These manuscripts provide high-quality information that can be particularly useful for specialized models such as natural language processing or historical analysis.Footnote 105 While textual manuscripts offer depth and richness, they are often limited in scope and may require extensive pre-processing, such as digitization or language translation, before they can be used.

In addition to textual manuscripts, online content—such as websites, social media posts, and digital publications—are another rich source of data for AI systems.Footnote 106 This content is extensive, constantly updated and often publicly accessible, making it an invaluable resource for training machine learning models. Web scraping techniques can be used to compile large data sets from these sources that capture different topics, languages, and user behaviors.Footnote 107 The advantages of online data include its real-time availability, which ensures that models are trained with the most up-to-date information, and its diversity, which helps models generalize better across different contexts and applications.Footnote 108 In contrast to textual manuscripts, online data tends to be more accessible, does not need to be digitized, and can be quickly integrated into machine learning workflows, making it a more practical choice for many applications.

Regarding the availability of online content in Portuguese, it is estimated that this language is in the top ten languages used on the internet.Footnote 109 The prevalence of online content in Portuguese reflects the linguistic, cultural, and demographic diversity of the Lusophone world, which includes countries such as Angola, Brazil, Cabo Verde, Equatorial Guinea, Guinea-Bissau, Mozambique, Portugal, São Tomé and Príncipe, and Timor-Leste. The online presence of these countries shows varying degrees of influence and focus, shaped by their unique sociocultural and geopolitical contexts. The differences in the language spoken in these locations and their online availability, if not adequately addressed, can skew the behavior of AI systems towards a particular location.Footnote 110

Although Portuguese is an important language on the internet and accounts for around 3.16 percent of all websites, Brazil and Portugal dominate the production of digital content.Footnote 111 As the largest Portuguese-speaking country, Brazil is a leader in the production and distribution of online content in Portuguese. With over 200 million inhabitants and widespread internet access, Brazil accounts for the majority of digital content in this language. Brazilian online content is diverse and includes entertainment, education, news, and social media. The country has a strong presence on social platforms such as YouTube, Instagram, and TikTok. This solid online presence has led to Brazilian Portuguese being the most recognized online variety.Footnote 112

Although Portugal has a smaller population than Brazil, it contributes significantly to the Portuguese online corpus, especially in the fields of journalism, science, and literature. With over 67 percent of the Portuguese population online, the robust Portuguese academic sector also provides a steady stream of scholarly publications in Portuguese, many of which are distributed worldwide via online platforms.Footnote 113 As major African Lusophone countries, Angola and Mozambique are increasingly contributing to the Portuguese-speaking online landscape. The expansion of internet infrastructure in these countries has enabled greater digital participation, especially among the younger population.Footnote 114 In Angola, online content often revolves around music, cultural expression, and sociopolitical issues. Content from Mozambique, while still developing, is gaining traction, particularly in the areas of education, social activism, and cultural preservation. These countries represent emerging voices in the Lusophone digital sphere.Footnote 115 Other Portuguese-speaking countries such as Cabo Verde, Guinea-Bissau, and Timor-Leste are contributing more niche content to the Portuguese-speaking online world. Although these countries have smaller populations and internet penetration, they offer unique perspectives that often focus on local culture, language preservation, and regional issues.

The dominance of Brazil and Portugal in the production of digital content and, consequently, in the datasets available for AI-driven tools has recently attracted the attention of the scientific community.Footnote 116 This topic has gained importance with the widespread adoption of these methods.Footnote 117 Recent studies on AI tools trained on general Portuguese-language data stress the need for specialized models to address the unique linguistic nuances from African Lusophone countries.Footnote 118 This problem is even more complex in the context of historical research, as differences in language can be amplified: not only due to the natural evolution of language over time, but also due to the lack of digital repositories that accurately represent the language used in earlier times. As a result, the implications of the underrepresentation of African-Portuguese data in the training of AI models for historical research remain largely unexplored.

Bias in Systems: The Impact of African Data Scarcity on AI Models in Historical Research

As we approach the final part of our discussion on the practical challenges of using AI models, we turn to concrete examples that illustrate the impact of the underrepresentation of African-Portuguese data, and reflect on the broader implications for digital history studies.

To put our theoretical concerns into practice, we compare three previously presented collections through the lens of AI: the Africana Collection at AHU, produced by Portuguese colonial agents; and two locally produced archives, the Dembos Collection and Sebestyén’s Caculo Cangola Collection. The Africana Collection is composed entirely in Portuguese and reflects the colonial administrative structure, imposing Portuguese norms, terminology, and hierarchical classifications. In contrast, the Dembos and Sebestyén collections provide an excellent example of the kind of archival material we expect to find in other Luso-African archives. The Dembos communities adopted writing based on their interactions with the Portuguese in Angola, whose language they incorporated into their own culture. This led to the production of documents written in Portuguese, yet infused with Kimbundu vocabulary and cultural expressions.Footnote 119 As Sebestyén notes, these locally produced African collections clearly reflect African perspectives and a microcosm.Footnote 120 Typically, a soba would deliver oral declarations which a local scribe would then transcribe into Portuguese, often preserving Kimbundu terms within the text. These linguistically and culturally hybrid documents are an example of the challenges that arise when analyzing African archival material with modern digital tools.

It is precisely when it comes to such hybrid and underrepresented sources that the limitations of AI become most apparent. The use of Natural Language Processing (NLP) models, specifically Named Entity Recognition (NER), Social Network Analysis (SNA), and Large Language Models (LLMs), has become increasingly vital in digital history research, allowing scholars to extract, analyze, and interpret vast amounts of information from historical archives.Footnote 121 However, these models often encounter significant challenges when applied to African digital sources due to biases and representativity issues inherent in the data.Footnote 122 In this section, we will present experiments and discuss the uneven historical representation of African voices and perspectives and their effects on these standard methods utilized in studies from digital history.

The first example presented concerns the Named Entity Recognition models. NER is a computational technique within Natural Language Processing that automatically identifies and categorizes named entities (such as people, locations, organizations, titles, occupations, dates, and other significant concepts) within a text corpus.Footnote 123 By identifying and classifying these elements, NER transforms unstructured texts into structured data, facilitating a range of valuable analyses in digital history. For example, this process enables researchers to automatically and efficiently locate, track, and analyze specific historical figures, places, and events across large-scale archives, including digitized manuscripts, newspapers, letters, and other historical documents. Model selection plays an essential role in identifying suitable algorithms tailored for tasks like NER, enabling efficient historical text analysis.Footnote 124

In digital history studies, NER offers substantial benefits for uncovering patterns and connections that may take time to be evident through manual analysis. For instance, historians can employ NER to trace the appearance and frequency of individuals or social groups over time, revealing significant cultural or political shifts. Similarly, NER can help reconstruct historical networks by mapping relationships between people and places, which is essential for understanding past societies’ social and geopolitical dynamics.Footnote 125

Researchers have two main approaches when applying NER to historical corpora: using off-the-shelf pretrained models or developing custom models.Footnote 126 Pretrained models are machine learning models initially trained on large, general-purpose datasets. Fig. 1 (below) shows examples of entities detected by a popular off-the-shelf NER model for texts in Portuguese. As discussed before, these models leverage datasets, usually from the internet, to learn how to extract the entities. However, the dominance of content from Brazil and Portugal creates biases in these models, making them unable to identify entities with African-based vocabulary correctly.

Figure 1. Examples of entities recognized in texts with vocabulary specific from Dembos Collection using a pretrained general purpose model. The green, red, and gray boxes indicate, respectively, correct, incorrect, and missed entities. The original text can be translated as “Letters addressed to soba Ngolla Tumba, to Dembo Bulo Atumba, to Mane Quilé Quissamba, and to D. Paulo Domingos.”

The text in Fig. 1 was extracted from the Dembos Collection and featured examples of terms originating from the Kimbundo language (including soba, dembo, and mane). The colored boxes around parts of the text represent the entities identified by the model, and we also show the model’s classification of the entities identified. In this example, LOC corresponds to a location, PER is a person’s name, and TITLE defines titles and occupations. Lastly, the color indicates if the entity was correctly identified (green), incorrectly identified (red), or missed (gray).

Looking at the performance of the off-the-shelf model, we can see in Fig. 1 that the model was able to identify most entities except the “soba.” However, three out of the four entities identified have incorrect classification. All misclassifications happened with entities with African-based words, and the only correct one has a common Portuguese name structure. Although there are models trained with historical sources instead of the internet and other modern data sources, these models are rare. They are usually available for specific languages such as English. To the best of our knowledge, no NER models are openly available in historical texts in African variants of Portuguese. In fact, there is a lack of resources for African languages in general.

One could argue that the performance issue observed in Fig. 1 might be solely due to archaic language, and to test this hypothesis, we also built a custom model. Custom-trained models have been specifically designed or adapted for a particular dataset or task rather than relying on generic, pretrained models. In digital history, this often means training the model to recognize patterns, relationships, or features unique to historical sources. In our case, we trained an NLP model on a Portuguese AHU data sample. This sample includes documents from the Africana Collection and has an overall accuracy of 96.5 percent on correctly identifying entities in texts from AHU. More information on how this model was created can be found in our previous publications and code repository.Footnote 127

Fig. 2 (below) shows the performance of the custom model on the same text of Fig. 1. In this example, the colors have the same meaning as the previous example. For the entity labels, MALE corresponds to the name of men, and TITLE defines titles and occupations. We can see in Fig. 2a that the custom model presents the same issue as the pretrained one, where the entities containing African vocabulary are not adequately identified. As a second experiment with the custom model, we replace the missed and mislabeled entities with their similar term in the European Portuguese language. Figure 2b shows that after the substitution, the model was able to correctly identify all the entities in the text.

Figure 2. Examples of entities recognized in texts with vocabulary specific from Dembos Collection using a model trained on the data from the Portuguese Overseas Archive. Note: (a) shows the original text; and (b) illustrates the same text but with African vocabulary replaced by their equivalent in European Portuguese. The green, red, and gray boxes indicate, respectively, correct, incorrect, and missed entities. The original text can be translated as “Letters addressed to soba Ngolla Tumba, to Dembo Bulo Atumba, to Mane Quilé Quissamba, and to D. Paulo Domingos.”

The experiments depicted in Fig. 2 indicate that more than simply training models with historical data might be needed to solve performance issues with NER for historical research. Although the text used in this example was extracted from the same digital archive used to train the model, there are still performance issues due to biases and a need for more representative examples from African vocabulary and sources in this archive. Despite the model being trained with samples from the Africana collection of AHU, this dataset is unbalanced and dominated by documents from Brazilian and Portuguese catalogs. Hence, this model does not properly capture the regional dialects or culturally specific names and terms in African documents. This imbalance in the data is likely the reason behind the poor model performance in African sources. In our case study, we are dealing with the words in the African language, and the inclusion of material from other Portuguese-speaking countries allowed the creation of a vast corpus of digital documents. Unfortunately, solving the problem will likely be more complex for other African languages that lack machine-readable archives, thereby restraining the development of custom historical models.

In the following example, we look into the consequences of the issues discussed in Social Network Analysis. SNA is another well-known method that is applied in digital history research.Footnote 128 SNA and NER models are deeply connected as they work to map and analyze relationships within historical data.Footnote 129 While NER identifies and classifies named entities, these extracted entities serve as the foundation for SNA, where the relationships between these entities are analyzed to reveal historical social networks and interactions. For instance, a historian could use NER to extract references to key individuals from archival letters and then apply SNA to visualize networks of influence, alliance, or communication among those individuals over time.

Fig. 3 (below) depicts a social network of occupations from a sample of documents from the Africana and Dembos collections. This network was constructed by linking the occupations detected by the custom NER model described in the previous experiment. As we showed, this model cannot identify titles and occupations using African vocabulary, so these kinds of occupations were extracted using a method to identify terms defined in a dictionary.

Figure 3. Example of a network of entities and other significant words connected to them. Note: The highlighted blue nodes are entities linked to “soba” which would not be properly identified due to the issues with the NER model.

With a balanced dataset, the computational model can identify, extract, and link the entities from the text. This automatic creation can be seen in Fig. 3 for terms such as “capitão” and “governador” (captain and governor in English, respectively), indicating their relevance in the documentation’s communication patterns. However, the blue part of the network with the connections related to the soba, created with human assistance in this example, would be missed entirely. This missing part of the network could lead to wrong interpretations that the sobas were not as relevant as the other two actors or at least not as present in the documentation. As in the previous example, this is not an issue of the method employed but a limitation of the data derived from the low availability of historical texts in machine-readable format from African sources/languages.

The issues of bias, representativity, and context that impact NER and SNA models are also present in more modern methodologies, such as Large Language Models.Footnote 130 LLMs, such as GPT or BERT, are pretrained on massive datasets that reflect societal, cultural, and historical biases.Footnote 131 When applied to historical data, due to representativity issues, LLMs may present issues interpreting and representing historically marginalized voices or specific languages, as these elements are often underrepresented in the training data.Footnote 132 This limitation leads to gaps in understanding, particularly for historically nuanced or culturally diverse contexts. It can produce outputs that flatten historical narratives by imposing modern or generalized interpretations on historical sources.

In the next experiments, we selected a document for the Dembos Collection as an input for different LLMs. Our goal with this experiment was twofold: first, to assess the susceptibility of state-of-the-art AI models to representativity issues; and, second, to verify if the large training data and the context sensitivity capability of LLMs would help them to recognize terms from the Dembos Collection properly. The input consisted of the following prompt: “You are an NER model specializing in historical African texts. Extract all the entities and their labels from the text ‘Letters addressed to soba Ngolla Tumba, to Dembo Bulo Atumba, to Mane Quilé Quissamba, and to D. Paulo Domingos.’” No additional information was provided to the model in this experiment. Moreover, due to limited contextual memory, the LLMs cannot process all texts simultaneously. Retraining the LLMs with our data was also not considered due to time and computational resource limitations. Fig. 4 (below) shows a few examples of outputs from three of the largest and most powerful LLM models.

Figure 4. Extraction of named entities from historical Dembos collection using LLMs. Note: (a), (b), and (c) present, respectively, the results of ChatGPT, Gemini, and Meta AI, three representatives of the state-of-the-art LLMs architectures. Notice that all models identified the entities but gave them the incorrect label of “Person.”

Besides the issues shown in Fig. 4, we also notice problems identifying the meaning of specific terms from African languages. For instance, the terms “muene nwale” and “muene mubanda,” found in documents from the Dembos Collection, refer to the first and second wives of the dembo (king), respectively. However, the historical sources with these interpretations are likely not present in the databases used to train the LLMs. Hence, the model could not “learn” the meaning of these terms in the provided context.

In previous experiments, we focused on recognizing key entities (NER) from text and the relations between them (SNA). The following experiments focus on gauging the digital history methods’ capability of interpreting the meaning of words rooted in the Kimbundu language. The meaning of words can shift depending on textual, historical, and geographic context. For instance, in literature, the term “liberty” may refer to different interpretations depending on whether it appears in a revolutionary-era pamphlet or a modern political speech. Geographically, a single word may carry distinct connotations or entirely different meanings, for example, “boot” refers to footwear in the US but to the trunk of a car in the UK. These variations demonstrate the importance of contextual information in accurately identifying the meaning of a word.

We designed the experiments to test the performance of digital history methods with different levels of contextual information. Fig. 5 (below) shows the level of contextual information used in these experiments. Level 1 tests the method’s capability of identifying the meaning just by knowing the language of the term. At this level, the process will output the usual sense of words commonly known. Otherwise, it will attempt to guess the meaning based on its etymology. Level 2 adds the textual context where the word appears. If the methods know the word, they will be able to make a better guess of its meaning in that particular context. If the model doesn’t know the word, it will use this context, along with the etymology, to guess the meaning. In the last level, level 3, we add the geographic and historical context, which, again, will aid in further narrowing down the possible meaning of the word in that context.

Figure 5. LLMs terms knowledge assessment. Note: Level 1 (Language Context) represents the most basic context you can provide to LLMs and expect meaningful results about the meaning of a term. In Level 2, besides the language, we also offer an example of text where the term appears. Lastly, in Level 3, we also provide the historical and geographic context of the text where the term appears.

For these experiments, we utilize Sebestyén’s Caculo Cangola Collection to determine whether LLMs can accurately identify the meanings of eight terms present in this collection. LLMs were chosen again since they represent the current state-of-the-art computational models for automatically identifying the meanings of a term in different contexts. Following are the prompts used for each context level:

  • Level 1: “Could you please explain the meaning of the term ‘X’, which has its roots in Kimbundu?”

  • Level 2: “Could you please explain the meaning of the term ‘X’ in this text, given its origin from Kimbundu? Textual Context: ‘Y’”

  • Level 3: “Could you please explain the meaning of the term ‘X’ in this text, which was extracted from an 18th-century document in the Dembos region of Angola? The language used in this document is a blend of Portuguese and Kimbundu. Textual Context: ‘Y’”

Table 1 (below) shows the results of the experiment for eight terms/words from Sebestyén’s Caculo Cangola Collection. We compared six of the most advanced LLM models

Table 1. Experiments assessing the LLMs’ knowledge of the meaning of some terms originated from Kimbumdo in Sebestyén’s Caculo Cangola Collection

Note: Three points mean that the LLM correctly reported the meaning of the term with Level 1 of contextual information. Two points mean that the correct meaning was identified only in Level 2. One point was given when the LLM identified the correct meaning with Level 3 of contextual information. Lastly, zero means that the technique was not able to locate proper meaning with all contextual information provided

currently available: ChatGPT (GPT-4-turbo), Claude (Claude Sonnet 4), DeepSeek (DeepSeek-R1), Grok (Grok 3), Gemini (Gemini 2.5 Pro), and Llama (Llama 4). For each term, the technique will score 3, 2, and 1, respectively, if it correctly identifies the meaning using levels 1, 2, or 3. A value of 0 means that it was unable to provide the correct meaning after all the context information was provided.

The results in Table 1 reveal that for most words, all techniques were unable to provide the expected meaning based on the contextual information. The two main exceptions were the case of “banza” and “sanzala.” In both cases, the answers provided by the LLMs indicate that the data source from which they “learned” the meaning came from locations outside Africa. For instance, one of the justifications for the meaning of “banza”—provided by ChatGPT—is based on official documents from the Portuguese Empire: “The Portuguese colonial administration adopted and adapted the term, using it in documents to refer to Indigenous settlements or administrative units.”

Concerning the “sanzala,” most of the techniques pointed to Brazilian sources that used a variation of this word. Observing the response from DeepSeek, for example:

The term gained particular historical significance during the period of the Atlantic slave trade and colonialism. In Brazil, “Senzala” (a Portuguese adaptation of the Kimbundu term) came to refer to the slave quarters on plantations - a tragic transformation of a word that originally signified community and belonging into one associated with oppression and forced labor.

The only exception was when the contextual information seemed to aid in identifying the proper meaning of the word “Cajinga.” After the textual, historical, and geographical context, all methods assumed that the word in that context could indicate a symbol of the soba’s power.

It is also worth noting that these LLMs have different main applications. While ChatGPT, DeepSeek, Gemini, and Llama are intended for general usage and are more susceptible to model hallucination (that is to say, when the model generates false, nonsensical, or factually incorrect information that isn’t grounded in its training data or reality), Claude was designed for writing and research tasks, prioritizing the safety and reliability of the answers. Lastly, Grok differentiates itself from the others due to its real-time access to information on the X/Twitter platform.

The differences in the characteristics of the LLMs help us understand their capability and reliability for translation and interpretation tasks. For instance, when asked about the meaning of dambo, the general-purpose LLMs would give answers such as the one provided by Gemini:

It was a physically constructed feature dug illegally by agents of Soba Cacullu Cacabassa on land claimed by Soba N’Bango a Caputo. Its primary purpose was not drainage or agriculture, but as an audacious act of trespass and a concrete symbol of asserting control and governance (governor) over the disputed territory on behalf of the intruder Cambanda.

The provided link for the sources often did not work, raising concerns about the reliability of the answers. From all the methods assessed, Claude was the only one which provided remarks about being unable to find sources to support its answers. A representative statement in these cases was provided for the term “cajinga”: “While the exact meaning remains uncertain due to limited documentation, the evidence suggests it is a physical item with deep cultural importance in 18th-century Ambundu society.” Still, the answers provided in most cases had an incorrect interpretation of the work in the given context.

Nevertheless, it is important to stress that the challenges machine learning models face in capturing accurate representations of historical and cultural data are not inherent flaws in the models themselves but rather reflect the limitations of the data used to train them. Data-driven models, such as the ones exemplified in this manuscript, are fundamentally shaped by their training datasets’ quality, diversity, and representativeness. The examples provided show that the models inevitably reflect gaps and biases inherited from the lack of detailed, balanced digital representations of African historical figures, events, and cultural contexts.

Moreover, the LLMs training processes raised significant concerns about the fair use of copyrighted text. In this case, a substantial fraction of the digital knowledge available (such as books, scientific papers, non-public archives, and private collections), even when publicly accessible, cannot be used in large-scale AI training. This scenario will lead to situations where the models might “know” that these books/sources exist because the title and abstract are publicly available. Still, their content is not publicly available or has copyright protection. Hence, the AI models will not learn the additional interpretations of terms defined in these sources.

In a context where African digital archives are incomplete or disproportionately feature perspectives from certain groups, the applicability, reliability, and usefulness of digital data-driven AI methods for historical research is limited. Addressing these limitations requires a fundamental effort to expand and diversify digital archives and datasets to improve model performance and ensure that the richness and complexity of historical and cultural narratives are accurately represented in digital historical research.

Challenges and Recommendations in African Digital Projects

As discussed above, African Digital Humanities projects continue to face a number of challenges. From an infrastructural perspective, two main problems stand out: limited support for low-resource languages and the persistent scarcity of data. The first refers to the limited digital resources, such as annotated corpora or linguistic tools, which complicate and shorten the possibilities of modeling. Since many African languages are still severely underrepresented in computational tools, the application of NLP methods to historical data is made more difficult.

However, the inclusion of local African languages in the digital domain is a challenge that can potentially be overcome. In recent years, more initiatives have been launched to facilitate the integration of African languages into digital platforms. Examples include the Masakhane project and the Knowledge 4 All Foundation, both of which focus on the collection of data and statistical modeling that can be used for computer-aided learning and natural language processing. These projects try to develop effective machine tools and models for some African native languages, notably including Ewe, Wolof, and Yoruba.Footnote 133 David Adelani, who is responsible for the development of the Yoruba-language model, points out several challenges. These include the complex structure of the language with its numerous dialects, which limits the availability of resources for creating fine-tuned models or supervised learning systems. He also emphasizes the urgent need to develop more cross-domain datasets to ensure the model works across different topics.Footnote 134 Even if there are no corresponding proposals for NLP for the historical texts yet, this is already a big step towards bringing more African languages into focus.

Fortunately, this linguistic progress parallels the growth of the local tech ecosystem in Africa, which is a promising prospect for Africa’s digital landscape. In recent years, four cities—Nairobi, Lagos, Cape Town, and Cairo—have emerged as tech hubs. Homemade digital technology startups, as Tevin Tafese calls the companies founded and built in Africa that adapt to a structurally weak environment and are developed by locals for locals, can have a positive impact on digital inclusion as they train young professionals in digital and programming skills and as professional software developers.Footnote 135

The scarcity of historical documentation, the second and arguably more pressing challenge, arises both from the lack of adequate historical records dating back to the early modern period and from persistent problems with archive management. This has led to a significant deficit of colonial historical records from African archives. To address this, a concerted effort is needed to collect data from local archives in Angola, Cabo Verde, and other African countries to ensure that historical information is accurately represented. However, such initiatives require greater government commitment, funding for education, and investment in heritage preservation. Converting data into machine-readable formats should be the next step for digital practitioners. This would not only help to protect endangered records, but also ensure their sustainability in the long term. Finally, it is also important to encourage data sharing initiatives. Such collaboration can enable the fine-tuning of models, even those pretrained on non-historical topics, to maximize their utility in the representation and analysis of historical records. Although there are efforts from the scientific community toward developing AI models capable of learning and extracting knowledge from a reduced amount of text, it is crucial to stress that even these models will struggle due to the current lack of digital historical sources. Moreover, researchers aiming to apply these AI models in analyzing historical texts must acknowledge and address their current limitations to minimize the risk of incorrect/biased interpretations. Additionally, a comprehensive package of measures is needed to ensure broader and more efficient access to African archives. While the budget increase is an important element, it should primarily be used to promote digitization, the recruitment of qualified staff, specialized training, the acquisition of the necessary material resources, and the development or adaptation of a platform to facilitate online consultation of these records. A greater dissemination of archives and digitized material would also help to increase the number of available sources while reducing bias in academic and scientific research.

These efforts, however, cannot be separated from the broader context of Africa’s digital landscape, as digital infrastructures are shaped by both external and internal pressures. Melody Musoni notes that these infrastructures are under the competing influence of three dominant digital spheres: the US, China, and Europe, each seeking to exert influence through its own principles. The US emphasizes the “promotion of open, secure and interoperable digital ecosystems” in line with American values. China offers security technologies, training, and support for surveillance and reconnaissance systems as part of its Digital Silk Road initiative. Europe, meanwhile, is focusing on digital human rights and robust data protection standards. Musoni has also formulated policy recommendations for African governments, especially for the protection of human rights offline and online, such as internet freedom and security, personal data protection, and more. These digital policies should encourage governments to enact new laws that recognize digital rights and give African communities the opportunity to learn digital skills and capabilities, and eventually gain more confidence in digital platforms.Footnote 136

Although many of the recommendations are beyond our immediate reach, there is one crucial step we can take as digital humanities professionals: to share our data and actively contribute to a culture of data sharing. Dylan Ruediger and Ruby MacDougall rightly identify a number of challenges associated with data sharing in the humanities and point out that a fundamental problem is that too little attention has been paid in the past to the value of data sharing and the importance of reproducibility and replicability in scholarly methods. Not only do we agree with these observations, but we also advocate for more transparency in our digital practices. The more openly we discuss our workflows and acknowledge the biases present in both archival sources and technological systems, the better we contribute to the “technology of recovery.”Footnote 137

Acknowledgements

This research is supported by the National Science Centre of Poland (Narodowe Centrum Nauki - NCN) grant number 2022/45/B/HS3/00473, project: “Imperial Commoners of Brazil and West Africa (1640-1822): global history from a correspondence network perspective.” The authors would like to thank the editors and reviewers for their valuable and constructive feedback. They are also grateful to Professor John K. Thornton for sharing Éva Sebestyén’s collection and Fernando Hélder Panzo Macaia for important information on the National Archives of Angola (Arquivo Nacional de Angola), in particular on the colonial documents held in these archives.

References

1 Roopika Risam and Kelly Baker Josephs, eds., The Digital Black Atlantic (Minneapolis: University of Minnesota Press, 2021); Kaby Wing-Sze Kung, ed., Reconceptualizing the Digital Humanities in Asia: New Representations of Art, History and Culture, vol. 2 (Singapore: Springer Nature, 2020); Héctor Fernández L’Hoeste and Juan Carlos Rodríguez, eds., Digital Humanities in Latin America (Gainesville: University Press of Florida, 2023).

2 Projects created in the USA: Robert Nowatzki, “From Datum to Databases: Digital Humanities, Slavery, and Archival Reparations,” The American Archivist 83, no. 2 (2020): 429–48; Sharon Block, “#DigEarlyAm: Reflections on Digital Humanities and Early American Studies,” The William and Mary Quarterly 76, no. 4, (2019): 611–48.

3 Susanna Allés-Torrent, Megan Jeanette Myers,and Élika Ortega, “New Dialogues in Spanish and Portuguese Studies: Pedagogical and Theoretical Perspectives from the Digital Humanities,” Hispania 104, no. 4 (2021): 535–41.

4 Kim Gallon, “Making a Case for the Black Digital Humanities,” in Debates in the Digital Humanities 2016, eds. Matthew K. Gold and Lauren F. Klein (Minneapolis: University of Minnesota Press, 2016), 42–49; Sara Collini, “Building a Relational Database to Explore Enslaved Midwives’ Work in Early America,” in American Revolutions in the Digital Age, eds. Nora Slonimsky, Mark Boonshoft, and Ben Wright (Ithaca, NY: Cornell University Press, 2024), 65–80; Safiya Umoja Noble, “Toward a Critical Black Digital Humanities,” in Debates in the Digital Humanities 2019, eds. Matthew K. Gold and Lauren F. Klein (Minneapolis: University of Minnesota Press, 2019), 27–35.

5 Roopika Risam, “Decolonizing the Digital Humanities in Theory and Practice,” in The Routledge Companion to Media Studies and Digital Humanities, ed. Jentery Sayers (New York: Routledge, 2018), 78–86.

6 Marcus P. Nevius, “Who Stands in the Digital Shadows?: ‘City of Refuge’ at the Intersection of ‘Old’ and ‘New’ Media in the Age of the Digital Humanities,” in Slonimsky, Boonshoft, and Wright, American Revolutions, 220–34.

7 Collini, “Building a Relational Database,” 65–80.

8 Peter Limb, “The Digitization of Africa,” Africa Today 52, no. 2 (2005): 3–19.

9 Jennifer Guiliano, Roopika Risam, Leah Junck, and James Yékú, “Editors’ Note: April 2024,” Reviews in Digital Humanities 5, no. 4 (2024), accessed 2 Dec. 2024. https://reviewsindh.pubpub.org/pub/editors-note-april-2024/release/1; Mary Minicka, “Towards a Conceptualization of the Study of Africa’s Indigenous Manuscript Heritage and Tradition,” Tydskrif vir Letterkunde 45, no. 1 (2008): 143–163; Tunde Opeibi, “Digitizing the Humanities in an Emerging Space,” in Risam and Josephs, Digital Black Atlantic, 162; Mary Minicka, “Timbuktu Rare Manuscripts Project: Promoting African Partnerships in the Preservation of Africa’s Heritage,” ESARBICA Journal 25 (2006), 28. In 2017, a list of over 130 Black Digital Humanities projects was published in Fire!!!: @CCP_org., “Black Digital Humanities Projects & Resources: A List of Projects, Resources, Events, and Anything Else,” Fire!!! 4, no. 1 (2017): 134–39.

10 Programme in African Digital Humanities, “Programme in African Digital Humanities, 2018–2023,” Wits Institute for Social and Economic Research, University of the Witwatersrand, accessed 10 Dec. 2024, https://wiser.wits.ac.za/page/programme-african-digital-humanities-2018-2023-13069.

11 Journal of the Digital Humanities Association of Southern Africa (DHASA), South African Centre for Digital Language Resources, accessed 10 Dec. 2024, https://digitalhumanities.org.za/jdhasa-journal/.

12 Galen Ayana et al., “Decolonizing Global AI Governance: Assessment of the State of Decolonized AI Governance in Sub-Saharan Africa,” Royal Society Open Science 11, no. 8 (2024): 231994, accessed 25 Aug. 2025, https://doi.org/10.1098/rsos.231994.

13 Guiliano, Risam, Junck, and Yékú, “Editors’ Note.”

14 Minicka, “Towards a Conceptualization,” 143–63; Opeibi, “Digitizing the Humanities,” 162; Minicka, “Timbuktu Rare Manuscripts Project,” 28.

15 Dominique Somda, “Review: Open Restitution Africa,” Reviews in Digital Humanities 5, no. 4 (2024), accessed 25 Aug. 2025, https://doi.org/10.21428/3e88f64f.c79d3856.; Frank Onuh, “Review: Paint Me Black,” Reviews in Digital Humanities 5, no. 4 (2024), accessed 25 Aug. 2025, https://doi.org/10.21428/3e88f64f.b3a87aa3.; Ama Bemma Adwetewa-Badu, “Review: African Digital Heritage,” Reviews in Digital Humanities 5, no. 4 (2024), accessed 25 Aug. 2025, https://doi.org/10.21428/3e88f64f.e16773b2.; Titilola Aiyegbusi, “Review: Pollicy,” Reviews in Digital Humanities 5, no. 4 (2024), accessed 25 Aug. 2025, https://doi.org/10.21428/3e88f64f.7c9196d9.

16 Ravynn K. Stringfield, “Exploring Constellations of Care and Professionalization in Black Feminist Digital Humanities: A Black Woman Graduate Student’s Reflection,” in Feminist Digital Humanities: Intersections in Practice, eds. Lisa Marie Rhody and Susan Schreibman (Urbana: University of Illinois Press, 2025), 143–58.

17 Fungai Machirori, “An Exploration of African Digital Cosmopolitanism” in Doing Digital Migration Studies: Theories and Practices of the Everyday, eds. Koen Leurs and Sandra Ponzanesi (Amsterdam: Amsterdam University Press, 2024).

18 Allés-Torrent, Myers,and Ortega, “New Dialogues,” 535–41.

19 Umoja Noble, “Toward a Critical Black Digital Humanities,” 31–32.

20 Nevius, “Who Stands in the Digital Shadows?,” 220–34.

21 Limb, “The Digitization of Africa,” 3–19. See also: Janet L. Stanley, “African Material Culture Information Network,” History in Africa, vol. 21 (1994): 377–79.

22 Yunusa Z. Ya’u, “Globalisation, ICTs, and the New Imperialism: Perspectives on Africa in the Global Electronic Village,” Africa Development / Afrique et Développement 30, nos. 1/2, (2005): 98–124.

23 James T. Murphy, Pádraig Carmody, and Björn Surborg, “Industrial Transformation or Business as Usual? Information and Communication Technologies and Africa’s Place in the Global Information Economy,” Review of African Political Economy 41, no. 140 (2014): 264–83.

24 Abeba Birhane, “Algorithmic Colonization of Africa,” SCRIPTed 17, no. 2 (2020): 389, accession date 25.08.2025, https://script-ed.org/?p=3888, DOI: 10.2966/scrip.170220.389; Naome A. Etori, Maurice Dawson, and Maria Gini, “Double-Edged Sword: Navigating AI Opportunities and the Risk of Digital Colonization in Africa,” in MWAIS 2024 Proceedings (2024), accessed 25 Aug. 2025, https://aisel.aisnet.org/mwais2024/25; Ayana et al., “Decolonizing Global AI Governance.”

25 Fabienne Chamelot, Vincent Hiribarren, and Marie Rodet, “Archives, the Digital Turn, and Governance in Africa,” History in Africa 47 (2020): 101–18.

26 Melody Musoni, Litha Mzinyati, and Deon Cloete, “Futures of Digital Rights in SADC: Towards a Common Approach to Digital Protection and Data Justice,” South African Institute of International Affairs, Occassional Paper No. 358 (2024), accessed 25 Aug. 2025, https://saiia.org.za/wp-content/uploads/2024/09/SAIIA_OP_358_FutureDigitalRights.pdf; Ronak Gopaldas, “Digital Dictatorship versus Digital Democracy in Africa,” South African Institute of International Affairs, Policy Insights 75 (2019) accessed 25 Aug. 2025, https://saiia.org.za/wp-content/uploads/2019/10/Policy-Insights-75-gopaldas.pdf; Idris Ademuyiwa and Adedeji Adeniran, “Policies to Promote Digital Enterprise and Entrepreneurship in Africa,” Centre for International Governance Innovation, Assessing Digitalization and Data Governance Issues in Africa, 1 Jul. 2020, 13–15, accessed 25 Aug. 2025, https://www.cigionline.org/sites/default/files/documents/no244_0.pdf.

27 Maria Emília Madeira Santos, “Prefácio,” in Africæ Monumenta: A Apropriação da Escrita pelos Africanos, eds. Ana Paula Tavares and Catarina Madeira Santos (Lisboa: Instituto de Investigação Científica Tropical, 2002), 9–20; António de Almeida, Relações com os Dembos: Das Cartas do Dembado de Kakulu-Kahenda (Lisboa: Sociedade Nacional de Typographia, 1938); David Magno, Guerras Angolanas: A Nossa Ação nos Dembos (Porto: Companhia Portuguesa Editora, 1937); Daiana Lucas Vieira, “As Cartas do Dembo Caculo Cacahenda: Um Pouco da História dos Dembos e da Relação Deste com as Autoridades Portuguesas Situadas em Angola (1780–1850),” Veredas da História 7, no. 1 (2014); David Magno, “Os Dembos nos Anais de Angola e Congo (1484–1912),” Revista Militar 69 (1917): 48–63; David Magno, Guerras Angolanas: A Nossa Ação nos Dembos (Porto: Companhia Portuguesa Editora, 1934). Éva Sebestyen’s collection from the archives at Caculo Cangola includes: Luamba (1717 and 1796); Ngolombe a Queta (1779); Samba Caju (1689); and Tuto (1671). Quoted in John Thornton, A History of West Central Africa (Cambridge: Cambridge University Press, 2020), 293n113. The authors thank Prof. Thornton for sharing the collection.

28 Érika Melek Delgado, “Freedom Narratives: The West African Person as the Central Focus for a Digital Humanities Database,” History in Africa 48 (2021): 35–59; Carmelita N. Pickett, “The Trans‐Atlantic Slave Trade Database: Voyages,” Reference Reviews 24, no. 5 (2010): 65–66.

29 South African Centre for Digital Language Resources, “SADiLaR – South African Centre for Digital Language Resources”, accessed 27 Dec. 2024, https://sadilar.org/en/.

30 The Arquivo Histórico Ultramarino was founded in the twentieth century, originally in 1931 as the Arquivo Histórico Colonial. Its purpose was to centralize colonial documents in one place. However, during the period under study, the Portuguese administration, together with other institutions, produced documents that were later incorporated into the Arquivo Histórico Ultramarino. Arquivo Histórico Ultramarino, website, “História,” accessed 5 Jan. 2025, https://ahu.dglab.gov.pt/historia/. Some of the most notable works on the Arquivo Histórico Ultramarino are: Maria Luísa Cunha Meneses Abrantes and José Sintra Martinheira, “A Modernização do Arquivo Histórico Ultramarino e a Valorização do Património Documental,” Africana 24 (2002): 121–48; Ana Canas Delgado Martins, Governação e Arquivos: D. João VI no Brasil (Lisboa: Instituto dos Arquivos Nacionais/Torre do Tombo, 2007); Maria da Conceição Lopes Casanova and Ana Canas Delgado Martins, “Práticas e Políticas Arquivísticas e de Conservação no AHU: Passado e Presente,” in Viagens e Missões Científicas nos Trópicos, 1883–2010, eds. Ana Cristina Martins and Teresa Albino (Lisboa: Instituto de Investigação Científica Tropical, 2010), 175–81; Filipa Lowndes Vicente, “Arquivo Histórico Ultramarino. Como Reinventar um Arquivo Histórico Colonial?,” Project Re-Mapping Memories. Lisboa-Hamburg. Lugares de Memória, 2021, accessed 25 Aug. 2025, https://www.re-mapping.eu/pt/lugares-de-memoria/arquivo-historico-ultramarino; Gautier Garnier, “Arquivos da memória colonial portuguesa: o Arquivo Histórico Ultramarino, 1931–1974,” Ler História 85 (2024): 143–64.

31 David Birmingham, Portugal and Africa (London: Palgrave Macmillan, 1999), 51–81; Miguel Geraldes Rodrigues, “The Portuguese Conquest of Angola in the Sixteenth and Seventeenth Centuries (1575–1641),” in The First World Empire: Portugal, War and Military Revolution, eds. Hélder Carvalhal, André Murteira, and Roger Lee de Jesus (New York: Routledge, 2021), 169–85; Mariana Candido, An African Slaving Port and the Atlantic World: Benguela and its Hinterland (Cambridge: Cambridge University Press, 2013); Gastão Sousa, Os Portugueses em Angola (Lisboa: Agência Geral de Ultramar, 1959); Ana Paula Tavares and Catarina Madeira Santos, “Fontes escritas africanas para a história de Angola,” Fontes & Estudos. Revista do Arquivo Histórico Nacional 4–5 (1998/99): 103; António Manuel Hespanha, Filhos da Terra: Identidades Mestiças nos Confins da Expansão Portuguesa (Lisboa: Tinta-da-china, 2019), 74; Ilídio do Amaral, O Consulado de Paulo Dias de Novais: Angola no Último Quartel do Século XVI e Primeiro do Século XVII (Lisboa: Instituto de Investigação Científica Tropical, 2000).

32 Luiz Felipe de Alencastro, O trato dos viventes: Formação do Brasil no Atlântico Sul, séculos XVI e XVII (São Paulo: Companhia das Letras, 2000), 111; Charles Ralph Boxer, Salvador de Sá and the Struggle for Brazil and Angola, 1602-1686 (London: London University Press, 1952).

33 Arquivo Histórico Ultramarino, Lisboa (AHU), Conselho Ultramarino (CU), Serviço de Partes, caixa 1, doc. 80; Angola, caixa 54, doc. 4951; and Angola, caixa 69, doc. 5962, referencing events including the defeat at Soyo (1670), Pyrrhic victory at Mpungu over Ndongo (1672), and the invasion of Matamba and Kasanje; John Thornton, Warfare in Atlantic Africa, 1500–1800 (London: Routledge, 1999).

34 Thornton, A History of West Central Africa, 250.

35 Boxer, Salvador de Sá, 138–39; Provisão de João Fernandes Vieira criando a Câmara de Massangano, 18 July 1658, in António Brásio, Monumenta Missionaria Africana (MMA), series 12 (Lisboa: Agência Geral de Ultramar, 1981), 169–70; Thornton, A History of West Central Africa, 250–52.

36 AHU, CU, Angola, caixa 12, d.1441; caixa 29, doc.280; Beatrix Heintze, “Luso-African Feudalism in Angola? The Vassal Treaties of the 16th to the 18th Century,” Revista Portuguesa de História 18 (1980): 111–31; Beatrix Heintze, “The Angolan Vassal Tributes of the 17th Century,” Revista de História Económica e Social 6 (1980): 57–78; David Wheat, “Garcia Mendes Castelo Branco, Fidalgo de Angola y Mercader de Esclavos en Veracruz y el Caribe a Principios del Siglo XVII,” in Debates Históricos Contemporáneos: Africanos y Afrodescendientes en México y Centroamérica, ed. María Elisa Velásquez (México: Centro de Estudios Mexicanos y Centroamericanos, 2011).

37 About the African oral tradition see: Phillip Curtin, “Oral Traditions and African History,” Journal of the Folklore Institute 6, no. 2/3 (1969): 137–55; Jan Vansina, “Once Upon a Time: Oral Traditions as History in Africa,” Daedalus 100, no. 2 (1971): 442–68; Gerald Moser, “Oral Traditions in Angolan Story Writing,” World Literature Today 53, no. 1 (1979): 40–45; Beatrix Heintze, “Written Sources, Oral Traditions as Written Sources: The Steep and Thorny Way to Early Angolan History,” Paideuma 33 (1987): 263–87.

38 Tavares and Madeira Santos, “Fontes escritas africanas,” 86–87, 91.

39 Diplomatic correspondence remains one of the most important sources for the extent and existence of this practice. Historians have analyzed the correspondence between major regional rulers and the Europeans, paying particular attention to the Kingdom of Kongo because of its longstanding relations with Portugal and, to a lesser extent, with the Holy See. The use of writing in diplomatic correspondence is also documented in other regions of the continent, for example in correspondence with the Negus of Ethiopia or the rulers of Dahomey. Luís Nicolau Parés, “Cartas do Daomé: uma introdução,” Afro-Ásia 47 (2013): 295–395.

40 Mbanza Kongo or São Salvador (according to European documents) was the capital and political center of the Kingdom of Kongo. Ilídio do Amaral, “Mbanza Kongo, Cidade do Congo, ou São Salvador: Contribuição para o conhecimento geográfico de uma aglomeração urbana africana ao sul do Equador, nos séculos XVI e XVII,” Garcia de Orta: revista da Junta de Missões Geográficas e de Investigações do Ultramar: Série de Geografia 12, no. 1–2 (1987): 1–40; Linda Heywood, “Mbanza Kongo/São Salvador: Culture and the Transformation of an African City, 1491 to 1670s,” in Africa’s Development in Historical Perspective, ed., Emmanuel Akyeampong, Robert H. Bates, Nathan Nunn, and James Robinson (Cambridge: Cambridge University Press; 2014), 366–90.

41 AHU, CU, Cx, 39, doc. 3718, Carta do [governador e capitão-general de Angola], conde do Lavradio, António de Almeida Soares Portugal de Alarcão Eça e Melo, ao rei D. João V, Angola; Richard Gray, “Come vero Prencipe Catolico: the Capuchins and the rulers of Soyo in the late seventeenth century,” Africa 53, no. 3 (1983): 39–54; David Birmingham, Portugal and Africa (London: Palgrave Macmillan, 1999), 63–81; Jan Vansina, “Portuguese vs Kimbundu: Language Use in the Colony of Angola (1575 – c. 1845),” Bulletin des séances 47, no. 7 (2001): 272.

42 The rulers of Kongo bore various titles. The best known was mwene, which, as Olfert Dapper emphasizes, was followed by the name of the territory or place over which a person ruled. They also bore other titles, such as ntotela/ntotila, which according to Jan Vansina means “the one who gathers the people around him.” Dapper, Description de l’Afrique, contenant les noms, la situation & les confins de toutes ses parties… (Amsterdam: Chez Wolfgang, Waesberge, Boom & van Someren, 1686): 351–52;, Paths in the Rainforest: Toward a History of Political Tradition in Equatorial Africa (Madison: University of Wisconsin Press, 1990): 156; Koen Bostoen, Odjas Ndondu Tshiyayi, and Gilles-Maurice de Schryver, “On the Origin of the Royal Kongo Title Ngangula,” Africana Linguistica 19 (2013): 54–75.

43 There is also evidence of communication between mwene and agents of the West India Company (WIC) in the seventeenth century. These would be the cases of Dom Garcia Afonso, mwene Mbamba and later King of Kongo, or Dom Daniel da Silva, mwene Soyo. Nationaal Archief (National Archives of the Netherlands), The Hague (NA), Chamber of Zeelandia, Letter in Portuguese from Don Paulo, Count of Sonho, to the Governor General Johan Maurits of Nassau and to the Councils of Brazil, 16 Aug. 1638; Biblioteca Nacional de España, Madrid (BNE), Manuscritos S-106, Alvará de D. Garcia II recomendando los capuchinos a su pueblo, 19 Sep. 1648; BNE, Manuscritos S-106, Respuesta del duque de Bata, D. Manuel Afonso, al alvará de Garcia II, 16 Nov. 1648.

44 John Thornton, “Kongo Administration and Written Documentation,” Portuguese Studies Review 30, no. 2 (2022): 25.

45 Between 1709 and 1715, the Kingdom of Kongo was reunited under the leadership of Pedro IV after a long civil war. John Thornton, The Kingdom of Kongo: Civil War and Transition, 1641–1718 (Madison: University of Wisconsin Press, 1983); John Thornton, The Kongolese Saint Anthony: Dona Beatriz Kimpa Vita and the Antonian Movement, 1684–1706 (Cambridge: Cambridge University Press, 1998), 10–35; Adrian Hastings, “The Christianity of Pedro IV of the Kongo, ‘The Pacific’ (1695–1718),” Journal of Religion in Africa 28, no. 2 (1998): 145–59.

46 Alfredo de Sarmento, Os sertões d’Africa (apontamentos de viagem) (Lisboa: Francisco Arthur da Silva, 1880), 52.

47 Ibid., 59.

48 Louis Jadin, “Rapport sur les recherches aux archives d’Angola, du 4 juillet au 7 septembre 1952,’ Bulletin des Séances de l’I.R.C.B. 24, no. 1 (1952): 159.

49 Adolph Bastian, Ein Besuch in San Salvador, der Hauptstadt des Königreichs Kongo (Bremen: Druck und Verlag von H. Strack, 1859), 134.

50 Almeida, Relações com os Dembos, 6.

51 The documentation, which was originally collected in 1934 and later brought to Lisbon, was only included in the Arquivo Histórico Ultramarino at the beginning of the twenty-first century. Between 2006 and 2009, three additions were made, starting with materials from the Centro de Antropobiologia do Instituto de Investigação Científica Tropical. Later, in 2014, a further ninety-six documents were added to the archive. Arquivo Histórico Ultramarino, Dembos, accessed 10 Dec. 2024, https://digitarq.ahu.arquivos.pt/details?id=1157639.

52 The earliest known reference to a translation into Kimbundu can be found in a letter from Father Baltasar Afonso from 1581, which reports on the translation of part of the Lord’s Prayer. Later, Jesuits such as Francesco Pacconio and Pedro Dias continued to develop catechisms in Kimbundu. António Brásio, MMA 15 (Lisboa: Agência Geral do Ultramar, 1988), 272; Francesco Pacconio & António do Couto, Gentio de Angola svfficientemente instruido nos mysterios de nossa sancta fé: Obra posthuma composta pello Padre Francisco Pacconio, redvsida a methodo mais breve pello Padre Antonio de Couto da mesma companhia (Lisboa: Domingos Lopes Resa, 1645); Pedro Dias, Arte da lingua de Angola: offerecida a Virgem Senhora N. do Rosario, Mãy, & Senhora dos mesmos Pretos (Lisboa: Oficina de Miguel Deslandes, 1697); John Thornton, “Conquest and Theology The Jesuits in Angola, 1548–1650,” Journal of Jesuit Studies 1 (2014): 252.

53 Arlindo Manuel Caldeira, “Luanda in the 17th Century: Diversity and Cultural Interaction in the Process of Forming an Afro-Atlantic City,” Nordic Journal of African Studies 22, nos. 1–2 (2013): 79–81.

54 Vansina, “Portuguese vs Kimbundu,” 270–71.

55 In Brazil, some archives such as municipal, ecclesiastical, and notarial archives have been preserved. Anthony John Russell Russell-Wood, “Brazilian Archives and Recent Historiography on Colonial Brazil,” Latin American Research Review 36, no. 1 (2001): 75–105.

56 David Birmingham, “Themes and Resources of Angolan History,” African Affairs: The Journal of the Royal African Society 73, no. 291 (1974): 188–203; Joseph C. Miller, “The Archives of Luanda, Angola,” The International Journal of African Historical Studies 7, no. 4 (1974): 551–90; José Curto, “Fontes eclesiásticas para a história de Angola antes de 1900: O caso do Arquivo do Bispado de Luanda,” Estudos Ibero-Americanos 48, no. 1 (2022): 1–16.

57 Charles Ralph Boxer, Portuguese Society in the Tropics: The Municipal Councils of Goa, Macao, Bahia, and Luanda, 1510–1800 (Madison: University of Wisconsin Press, 1965), 11.

58 Miller, “The Archives of Luanda,” 551–590; Joseph C. Miller and John K. Thornton, “The Chronicle as Source, History, and Hagiography: The ‘Catálogo dos Governadores de Angola,’” Paideuma 33 (1987), 359–89.

59 Adriana Romeiro, “O Universo do Arquivo Histórico Ultramarino,”História Social 3 (1996): 231–35.

60 Arquivo Científico Tropical, Portugal Collection, accessed 5 Dec. 2024, https://actd.iict.pt/collection/actd:CUF001.

61 Portuguese Government, Diário do Governo, Iª série, no. 133 (1931), 1080.

62 Its documents were transferred to the Secretaria de Estado dos Negócios da Marinha e Ultramar in 1833. In 1889, the collection was moved to the Biblioteca Nacional de Lisboa (BNL), due to a lack of space, where it became part of the Secção Ultramarina from 1901 onward.

63 Garnier, “Arquivos da memória.”

64 Vimbai C. Kwashirai, “Environmental History of Africa,” Encyclopedia of Life Support Systems (EOLSS), 2012, 1–13, accessed 25 Aug. 2025, https://www.eolss.net/Sample-Chapters/C09/E6-156-35.pdf.

65 Chris Prom, Gladys Kemboi, Sarah Cummings, and Ruby Martinez, “Catalyzing African Community Archives for Social Good,” in iPRES 2024 Papers—International Conference on Digital Preservation, accessed 25 Aug. 2025, https://doi.org/10.21428/5676bf2d.058388ed.

66 Stephen Ellis, “Writing Histories of Contemporary Africa,” The Journal of African History 43, no. 1 (2002): 1–26.

67 Joseph C. Miller, “Angola Before 1900: A Review of Recent Research,” African Studies Review 20, no. 1 (1977): 103–16.

68 Marilla MacGregor, “Confronting Different Realities: Libraries in Cabo Verde and the Case for Comparative Librarianship,” e-Journal of Portuguese History 20, no. 2 (2022): 68–85.

69 Isaías Barreto da Rosa, Promover o desenvolvimento de Cabo Verde através das TIC (Praia: ISEditorial, 2025).

71 Justin Pearce, “Contesting the Past in Angolan Politics,” Journal of Southern African Studies 41, no. 1 (2015): 103–119; Vasco Martins, “Hegemony, Resistance and Gradations of Memory: The Politics of Remembering Angola’s Liberation Struggle,” History and Memory 33, no. 2 (2021): 80–106; Vasco Martins, “‘Grande Herói Da Banda’: The Political Uses of the Memory of Hoji Ya Henda in Angola,” The Journal of African History 63, no. 2 (2022): 231–47; Terence Ranger, “Nationalist Historiography, Patriotic History and the History of the Nation: the Struggle over the Past in Zimbabwe,” Journal of Southern African Studies 30, no. 2 (2004): 215–23. For a discussion of how the creation of archives intersects with the formation of political power in a different geographical context, in this case Vietnam, see: Michelle Caswell, Archiving the Unspeakable: Silence, Memory, and the Photographic Record in Cambodia (Madison: University of Wisconsin Press, 2014).

72 James Yékú, “Social Media Images as Digital Sources for West African Urban History,” Oxford Research Encyclopedia of African History, 17 Apr. 2024, accessed 10 July 2025, https://doi.org/10.1093/acrefore/9780190277734.013.977.

73 Associação Tchiweka de Documentação, accessed 27 April 2025, https://tchiweka.org/.

74 “Memórias da Luta Armada em Angola”, accessed 27 April 2025, https://www.facebook.com/profile.php?id=100088390988139; “The Nigerian Nostalgia 1960 -1980 Project”, accessed 27 April 2025, https://www.facebook.com/groups/nigeriannostalgiaproject

75 Yékú, “Social Media Images.”

76 Limb, “The Digitization of Africa,” 3–19.

77 Miller, “Angola Before 1900,” 103–16.

78 Important projects created with the data of the Arquivo Historico Ultramarino is a Portugal-based project “Counting Colonial Population.” See Jelmer Vos and Paulo Teodoro de Matos, “The Demography of Slavery in the Coffee Districts of Angola, c. 1800–70,” The Journal of African History 62, no. 2 (2021): 213–34; Paulo Teodoro de Matos, “Counting Portuguese Colonial Populations, 1776–1875: A Research Note,” The History of the Family 21, no. 2 (2016): 267280.

79 José C. Curto, “The Angolan Manuscript Collection of the Arquivo Histórico Ultramarino; Lisbon: Toward a Working Guide,” History in Africa 15 (1988): 163–89.

80 Projeto Ultramar, accessed 10 Oct. 2024, http://www.liber.ufpe.br/ultramar/modules/home/resgate.php.

81 Projeto Resgate Barão do Rio Branco, accessed 27 December 2024, https://bndigital.bn.gov.br/dossies/projetoresgate/sobre-o-projeto-resgate-barao-do-rio-branco/; Caio César Boschi, “Projeto Resgate: História e Arquivística (1982-2014),” Revista Brasileira de História 38 (2018): 187–208.

82 Juciene Ricarte Cardoso, Josinaldo Sousa de Queiroz, Janailson Macêdo Luiz, and Thiago Silveira de Melo, Catálogo Geral dos Manuscritos Avulsos e em Códices Referentes à Escravidão Negra no Brasil Existentes no Arquivo Histórico Ultramarino (Campina Grande: EDUEPB, 2016); Juciene Ricarte Cardoso, Josinaldo Sousa de Queiroz, Janailson Macêdo Luiz e Thiago Silveira de Melo, Catálogo Geral dos Manuscritos Avulsos e em Códices Referentes à História Indígena no Brasil Existentes no Arquivo Histórico Ultramarino (Campina Grande: EDUEPB, 2016); Research Instruments on the Brazilian Collection, accessed 27 Dec. 2024, https://www.gov.br/bn/pt-br/central-de-conteudos/projeto-resgate/novos-instrumentos-de-pesquisa.

83 Curto, “The Angolan Manuscript Collection,” 163–89.

84 Arquivo Histórico Ultramarino, “Cabo Verde,” accessed 27 Dec. 2024, https://digitarq.ahu.arquivos.pt/details?id=1119364; Arquivo Histórico Ultramarino, “Moçambique,” accessed 27 Dec. 2024, https://digitarq.ahu.arquivos.pt/details?id=1119366.

85 Petri Leskinen and Eero Hyvönen, “Linked Open Data Service about Historical Finnish Academic People in 1640–1899,” CEUR Workshop Proceedings 2612 (2020): 284–92, accessed 25 Aug. 2025, https://ceur-ws.org/Vol-2612/short14.pdf.

86 Henry B. Lovejoy, “Pre-Colonial African Subregions,” World Historical Gazetteer, accessed 27 Oct. 2024, https://doi.org/10.60681/whg-dataset-1155; Henry B. Lovejoy et al., “Defining Regions of Pre-Colonial Africa: A Controlled Vocabulary for Linking Open-Source Data in Digital History Projects,” History in Africa 48 (2021): 9–34.

87 Lucas Vieira, “As Cartas.”

88 The collection Dembos (PT/AHU/DEMBOS) is available on the Digitarq wesbite, accessed July 15 2025, https://digitarq.arquivos.pt/documentDetails/381e8e909f864a7089972d159f54d1e7.

89 Éva Sebestyen’s collection from the archives at Caculo Cangola includes: Luamba (1717 and 1796); Ngolombe a Queta (1779); Samba Caju (1689); and Tuto (1671). Quoted in Thornton, A History of West Central Africa, 293n113.

90 As part of our project, Fernando Hélder Panzo Macaia interviewed Luzia de Carvalho, a staff member at the National Archives of Angola (Arquivo Nacional de Angola), on issues related to the existing colonial records in this archive. She reported that the oldest document held by the ANA dates back to 1591 and is a parchment written on lambskin. The archive also preserves documents relating to the treaties of Cabinda, Xifuma, and Simulambuco, among others.

91 Aurélio Filipe Mussengue et al., “Potencialidades e Desafios do Trabalho em Arquivos de Angola: Memórias, Relatos e Experiências de Pesquisadores Brasileiros e Angolanos,” in Brasil e Áfricas Redes, Circulação Cultural e Trânsitos Artísticos, eds. Crislayne Alfagali et al. (Rio de Janeiro: Autografia, 2024), 403–56; Mário João Lázaro Vicente, Os Sobas e a Construção de Angola nos Séculos XVI e XVII (Master’s thesis, Universidade Nova de Lisboa, 2021).

92 Ayana et al., “Decolonizing Global AI Governance,” 231994.

93 Wikidata, s.v. “Nzinga of Ndongo and Matamba,” last modified 27 Dec. 2024, https://www.wikidata.org/wiki/Q467650; Wikipedia, s.v. “Nzinga of Ndongo and Matamba,” last modified 27 Dec. 2024, https://en.wikipedia.org/wiki/Nzinga_of_Ndongo_and_Matamba.

94 Peter Mark, “The Central Upper Guinea Coast in the Pre-Contact and Early Portuguese Period, Fifteenth to Seventeenth Centuries,” Paideuma: Mitteilungen zur Kulturkunde 67 (2021): 113–44.

95 Alexandra L’Heureux, Katarina Grolinger, Hany F. Elyamany, and Miriam A.M. Capretz, “Machine Learning with Big Data: Challenges and Approaches,” IEEE Access 5 (2017): 7776–97.

96 Jean-Claude Müller, Robert Weibel, Jean-Philippe Lagrange, and François Salgé, “Generalization: State of the Art and Issues,” in GIS and Generalisation, Jean-Claude Müller, Robert Weibel, Jean-Philippe Lagrange, and François Salgé (London: Taylor & Francis, 2020), 3–17.

97 Claudia Wanderley, “Oral Cultures and Multilingualism in a World of Big Digital Data: The Case of Portuguese Speaking Countries,” Education for Information 34, no. 3 (2018): 239–54.

98 Clotilde de Almeida Azevedo Murakawa, “Lexicografia e História: O Dicionário Histórico do Português do Brasil - Séculos XVI, XVII e XVIII,” in Os Estudos Lexicais em Diferentes Perspectivas, vol. 1, eds. Ieda Maria Alves et al. (São Paulo: FFLCH/USP, 2009), accessed 25 Aug. 2025, www.fflch.usp.br/dlcv/neo.

99 Rafael Giusti et al., “Automatic Detection of Spelling Variation in Historical Corpus: An Application to Build a Brazilian Portuguese Spelling Variants Dictionary,” Proceedings of the Corpus Linguistics Conference (2007): 5, accessed 25 Aug. 2025, https://ucrel.lancs.ac.uk/publications/CL2007/paper/238_Paper.pdf; Arnaldo Junior Candido and Sandra Aluísio, “Building a Corpus-based Historical Portuguese Dictionary: Challenges and Opportunities,” Traitement Automatique des Langues 50, no 2: (2009): 73–102; Iris Hendrickx, Michel Généreux, and Rita Marquilhas, “Automatic pragmatic text segmentation of historical letters,” in Language Technology for Cultural Heritage: Selected Papers from the LaTeCH Workshop Series, eds. Caroline Sporleder, Antal van den Bosch, and Kalliopi Zervanou (Berlin: Springer, 2011), accessed 25 Aug. 2025, DOI 10.1007/978-3-642-20227-8__8; Michael Piotrowski, “Natural Language Processing for Historical Texts” in Synthesis Lectures on Human Language Technologies, ed. Graeme Hirst (Cham: Springer, 2012), 115, accessed 25 Aug. 2025, https://doi.org/10.1007/978-3-031-02146-6.

100 Mark Davies, “The Corpus do Português and the Frequency Dictionary of Portuguese,” in Working with Portuguese Corpora, eds. Tony Berber Sardinha and Telma de Lurdes São Bento Ferreira (London: Bloomsbury Academic, 2014), 89.

101 Corpus do Português, accessed 25 Nov. 2024, https://www.corpusdoportugues.org.

102 Limb, “The Digitization of Africa.”

103 African Journals Online, accessed 15 Apr. 2025, https://www.ajol.info/index.php/ajol.

104 Weixin Liang et al., “Advances, challenges and opportunities in creating data for trustworthy AI,” Nature Machine Intelligence 4, no. 8 (2022): 669–77.

105 Hayden White, “The Historical Text as Literary Artifact,” in The History and Narrative Reader, ed. Geoffrey Roberts (London: Routledge, 2001), 223.

106 Ifeanyi Madu and Chidiebere Nzenwa, “Ai in Web Development: Enhancing Accessibility,” 1 July 2024, accessed 25 Aug. 2025, http://dx.doi.org/10.2139/ssrn.4949516.

107 Moaiad Ahmad Khder, “Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application,” International Journal of Advances in Soft Computing & Its Applications 13, no. 3 (2021): 145–68.

108 Samuel Lefever, Michael Dal, and Ásrún Matthíasdóttir, “Online Data Collection in Academic Research: Advantages and Limitations,” British Journal of Educational Technology 38, no. 4 (2007): 574–82.

109 Fábio Bif Goularte, Bruno Emanuel da Graça Martins, Paula Cristina Quaresma da Fonseca Carvalho, and Miguel Won, “SentPT: A Customized Solution for Multi-Genre Sentiment Analysis of Portuguese-Language Texts,” Expert Systems with Applications 245 (2024): 123075.

110 Alessandro Ghio, “Democratizing Academic Research with Artificial Intelligence: The Misleading Case of Language,” Critical Perspectives on Accounting 98 (2024): 102687.

111 Daniel Pimienta, “Enhanced and Second Version of an Alternative Approach to Produce Indicators of Languages in the Internet” (2021), accessed 1 Dec. 2024, https://funredes.org/lc2022/ALI%20V2-EN.pdf; Daniel Pimienta, Daniel Prado, and Álvaro Blanco, Twelve Years of Measuring Linguistic Diversity in the Internet: Balance and Perspectives (Paris: UNESCO, 2009).

112 Sarah Anne Ganter and Fernando Oliveira Paulino, “Between Attack and Resilience: The Ongoing Institutionalization of Independent Digital Journalism in Brazil,” in Digital Journalism in Latin America (New York: Routledge, 2023), 106–25.

113 Luís Antero Reto, Potencial Económico da Língua Portuguesa (Lisboa: Leya, 2012).

114 Libuseng Malephane, “Digital Divide: Who in Africa Is Connected and Who Is Not,” Afrobarometer Pan-Africa Profile, no. 582 (2022): 1–20.

115 Jean Mayer, “Development Problems and Prospects in Portuguese-Speaking Africa,” International Labour Review 129, no. 4 (1990): 459–78.

116 Joaquim Mussandi and Andreas Wichert, “NLP Tools for African Languages: Overview,” in Proceedings of the 16th International Conference on Computational Processing of Portuguese, eds. Pablo Gamallo et al. (Santiago de Compostela: Association for Computational Linguistics, 2024): 73–82.

117 Denilson Alves Pereira, “A Survey of Sentiment Analysis in the Portuguese Language,” Artificial Intelligence Review 54, no. 2 (2021): 1087–115.

118 Diego Fernando Válio Antunes Alves, “An Evaluation of Portuguese Language Models’ Adaptation to African Portuguese Varieties,” in Gamallo et al., Proceedings, 544–550.

119 Santos, “Prefácio,” 9–20; Almeida, Relações com os Dembos; Magno, Guerras Angolanas; Daiana Lucas Vieira, “As Cartas do Dembo Caculo Cacahenda: Um Pouco da História dos Dembos e da Relação Deste com as Autoridades Portuguesas Situadas em Angola (1780–1850),” Veredas da História 7, no. 1 (2014); David Magno, “Os Dembos,” 48–63.

120 Sebestyén, Caculo Cangola Collection, 17.

121 Venkatesh Shankar and Sohil Parsana, “An Overview and Empirical Comparison of Natural Language Processing (NLP) Models and an Introduction to and Empirical Application of Autoencoder Models in Marketing,” Journal of the Academy of Marketing Science 50, no. 6 (2022): 1324–350. David Nadeau and Satoshi Sekine, “A Survey of Named Entity Recognition and Classification,” Lingvisticae Investigationes 30, no. 1 (2007): 3–26. Shazia Tabassum, Fabiola S. F. Pereira, Sofia Fernandes, and João Gama, “Social Network Analysis: An Overview,” WIREs Data Mining and Knowledge Discovery 8, no. 5 (2018): e1256. Wayne Xin Zhao et al., “A Survey of Large Language Models,” updated 1 Mar. 2025, arXiv preprint arXiv:2303.18223.

122 Wilhelmina Nekoto et al., “Participatory Research for Low-Resourced Machine Translation: A Case Study in African Languages,” updated 6 Nov. 2020, accessed 25 Aug. 2025, arXiv preprint arXiv:2010.02353.

123 Marco Humbel et al., “Named-Entity Recognition for Early Modern Textual Documents: A Review of Capabilities and Challenges with Strategies for the Future,” Journal of Documentation 77, no. 6 (2021): 1223–47.

124 Ibid.

125 Agata Błoch, Demival Vasques Filho, and Michał Bojanowski, “Networks from Archives: Reconstructing Networks of Official Correspondence in the Early Modern Portuguese Empire,” Social Networks 69 (2022): 123–135.

126 Andrew B. Speer, James Perrotta, and Tobias L. Kordsmeyer, “Taking It Easy: Off-the-Shelf Versus Fine-Tuned Supervised Modeling of Performance Appraisal Text, “Organizational Research Methods (2024): 10944281241271249.

127 Błoch, Vasques Filho, and Bojanowski, “Networks from Archives.” Clodomir Santana, Demival Vasques Filho, Michał Bojanowski, and Agata Błoch, “Unveiling the Critical Nexus of Data Preprocessing and Transparent Documentation for Result Quality and Reproducibility in Digital History,” Digital Humanities Quarterly 19, no. 3; a version of the dataset used in this research can be accessed on Zenodo, https://doi.org/10.5281/zenodo.15766967.

128 Emily Buchnea and Ziad Elsahn, “Historical Social Network Analysis: Advancing New Directions for International Business Research,” International Business Review 31, no. 5 (2022): 101990.

129 Sam Fields, Camille Lyans Cole, Catherine Oei, and Annie T. Chen, “Using Named Entity Recognition and Network Analysis to Distinguish Personal Networks from the Social Milieu in Nineteenth-Century Ottoman–Iraqi Personal Diaries,” Digital Scholarship in the Humanities 38, no. 1 (2023): 66–86.

130 Emilio Ferrara, “Should chatgpt be biased? challenges and risks of bias in large language models, “arXiv preprint arXiv:2304.03738 (2023).

131 Gokul Yenduri et al., “GPT (Generative Pre-Trained Transformer): A Comprehensive Review on Enabling Technologies, Potential Applications, Emerging Challenges, and Future Directions,” IEEE Access 12 (2024): 54608–49, doi: 10.1109/ACCESS.2024.3389497. Mikhail V. Koroteev, “BERT: A Review of Applications in Natural Language Processing and Understanding,” updated 22 Mar. 2021, arXiv preprint arXiv:2103.11943.

132 Angelina Wang, Jamie Morgenstern, and John P. Dickerson, “Large Language Models Cannot Replace Human Participants Because They Cannot Portray Identity Groups,” updated 3 Feb. 2025, arXiv preprint arXiv:2402.01908.

133 Jade Abbot, PE, et al., Masakhane MT: Decolonise Science, website, accessed 27 Dec. 2024, https://www.masakhane.io/ongoing-projects/masakhane-mt-decolonise-science.

134 Knowledge 4 All Foundation, “Cracking the Language Barrier for a Multilingual Africa,” website, accessed 27 Dec. 2024, https://k4all.org/project/language-dataset-fellowship/. See also Shamsuddeen Hassan Muhammad et al., “Afrisenti: A Twitter Sentiment Analysis Benchmark for African Languages,” updated 4 Nov. 2023, arXiv preprint arXiv:2302.08956; Jesujoba O. Alabi, David I. Adelani, Marius Mosbach, and Dietrich Klakow, “Adapting Pre-Trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning,” updated 18 Oct. 2022, arXiv preprint arXiv:2204.06487.

135 Tevin Tafese, “Digital Africa: How Big Tech and African Startups Are Reshaping the Continent,” German Institute of Global and Area Studies (GIGA), no. 6 (2022), https://doi.org/10.57671/gfaf-22062.

136 Musoni, Mzinyati, and Cloete, “Futures of Digital Rights.”

137 Dylan Ruediger and Ruby MacDougall, “Are the Humanities Ready for Data Sharing?”, ITHAKA S+R (2023): 1–14.

Figure 0

Figure 1. Examples of entities recognized in texts with vocabulary specific from Dembos Collection using a pretrained general purpose model. The green, red, and gray boxes indicate, respectively, correct, incorrect, and missed entities. The original text can be translated as “Letters addressed to soba Ngolla Tumba, to Dembo Bulo Atumba, to Mane Quilé Quissamba, and to D. Paulo Domingos.”

Figure 1

Figure 2. Examples of entities recognized in texts with vocabulary specific from Dembos Collection using a model trained on the data from the Portuguese Overseas Archive. Note: (a) shows the original text; and (b) illustrates the same text but with African vocabulary replaced by their equivalent in European Portuguese. The green, red, and gray boxes indicate, respectively, correct, incorrect, and missed entities. The original text can be translated as “Letters addressed to soba Ngolla Tumba, to Dembo Bulo Atumba, to Mane Quilé Quissamba, and to D. Paulo Domingos.”

Figure 2

Figure 3. Example of a network of entities and other significant words connected to them. Note: The highlighted blue nodes are entities linked to “soba” which would not be properly identified due to the issues with the NER model.

Figure 3

Figure 4. Extraction of named entities from historical Dembos collection using LLMs. Note: (a), (b), and (c) present, respectively, the results of ChatGPT, Gemini, and Meta AI, three representatives of the state-of-the-art LLMs architectures. Notice that all models identified the entities but gave them the incorrect label of “Person.”

Figure 4

Figure 5. LLMs terms knowledge assessment. Note: Level 1 (Language Context) represents the most basic context you can provide to LLMs and expect meaningful results about the meaning of a term. In Level 2, besides the language, we also offer an example of text where the term appears. Lastly, in Level 3, we also provide the historical and geographic context of the text where the term appears.

Figure 5

Table 1. Experiments assessing the LLMs’ knowledge of the meaning of some terms originated from Kimbumdo in Sebestyén’s Caculo Cangola Collection