Cultural resources are the tangible and intangible elements connected to the presence, practices, and identity of past and present societies. In the United States, they are managed and protected under multiple local, state, tribal, and federal laws/agreements. Archaeological resources are a component of cultural resources that constitute the physical evidences of past humans and are inventoried and managed on public lands. However, the logistics, costs, and capacity to complete extensive pedestrian surveys to document and monitor archaeological resources present challenges to land managers. While the increasing availability of high-resolution airborne laser scanning (ALS) datasets, referred to as aerial lidar, is revolutionizing data collection and analysis across multiple fields, including cultural resource management (CRM), many cultural resource managers lack the dedicated time or computing resources to process these data, let alone leverage them for analytical or management applications. To address these issues, we have developed a lidar and deep learning (DL) workflow specifically designed to create planning and management geographic information system (GIS) layers for agency CRM professionals to help them refine the focus of their pedestrian surveys rather than to replace them. The workflow contains a streamlined set of Python scripts that train DL models to detect specific archaeological features, as well as denoising post-processing steps that can be done in QGIS to reduce the burden of manual investigation of the results. Here, we present this workflow through a case study focused on detecting historic agricultural terraces in the Piedmont National Wildlife Refuge, Georgia, USA. While this workflow is focused on detecting a single archaeological resource type, the flexibility of our approach means that it can be modified for other site and feature types. This article has two aims: (1) to showcase the high performance of DL models to detect terraces in lidar-derived imagery; and (2) to identify the minimum number of terraces required to train DL models to produce highly accurate detections of terraces in new areas.
Following the lead of Traviglia and others (Reference Traviglia, Cowley and Lambers2016), we first define the terminology used in this study to promote clarity and consistency for multidisciplinary readers. Here, we follow the computer vision terminology and use the term “object” to signify archaeological “sites” and “features.” A “class” refers to the type of object a model learns to detect (e.g., terrace, tar kiln, charcoal pit, etc.). Finally, a “model” refers to an algorithm trained to recognize patterns. In the first part of this article, the term “model” encompasses different types of DL algorithms, whereas in Materials and Methods and subsequent sections it refers primarily to the U-Net structure we used for our case study.
Background
Aerial Laser Scanning
Archaeologists have used ALS for decades to detect archaeological objects in remote locales or areas where ground visibility is obscured by dense vegetation (e.g., Clode et al. Reference Clode, Kootsookos and Rottensteiner2004; Freeland et al. Reference Freeland, Heung, Burley, Clark and Knudby2016; Hesse Reference Hesse2010; Suh and Ouimet Reference Suh and Ouimet2023). Multiple studies have demonstrated aerial lidar’s potential in identifying archaeological object types, such as barrows and cairns (Cerrillo-Cuenca Reference Cerrillo-Cuenca2017; Olivier and Verschoof-van der Vaart Reference Olivier and Verschoof-van der Vaart2021), mining pits (Gallwey et al. Reference Gallwey, Eyre, Tonkins and Coggan2019), charcoal hearths (Carter et al. Reference Carter, Blackadar and Conner2021; Davis and Lundin Reference Davis and Lundin2021; Olivier and Verschoof-van der Vaart Reference Olivier and Verschoof-van der Vaart2021; Suh et al. Reference Suh, Anderson, Ouimet, Johnson and Witharana2021), shell rings (Davis et al. Reference Davis, Lipo and Sanger2019), agricultural field borders (Ladefoged et al. Reference Ladefoged, McCoy, Asner, Kirch, Puleston, Chadwick and Vitousek2011), irrigation ditches and terraces (McCoy et al. Reference McCoy, Asner and Graves2011), and other archaeological structures (Bundzel et al. Reference Bundzel, Jaščur, Kováč, Lieskovský, Sinčák and Tkáčik2020; Guyot et al. Reference Guyot, Lennon, Lorho and Hubert-Moy2021; Somrak et al. Reference Somrak, Džeroski and Kokalj2020). While analytical methods vary by research design and objectives, the growing number of recent studies showcase the impact of lidar in archaeological applications. As more lidar datasets become freely available online, archaeologists are able to study wide areas of multiple regions at a relatively low cost (see Carter et al. Reference Carter, Blackadar and Conner2021; Cerrillo-Cuenca Reference Cerrillo-Cuenca2017; Gallwey et al. Reference Gallwey, Eyre, Tonkins and Coggan2019; Herrault et al. Reference Herrault, Poterek, Keller, Schwartz and Ertlen2021; Küçükdemirci et al. Reference Küçükdemirci, Landeschi, Ohlsson and Dell’Unto2023; Meng Reference Meng2023; Olivier and Verschoof-van der Vaart Reference Olivier and Verschoof-van der Vaart2021; Quintus et al. Reference Quintus, Davis and Cochrane2023, who all used open-access lidar datasets). However, while lidar data help enhance the visibility of obscured archaeological objects, its processing and analysis can require long hours and intensive computing resources.
Deep Learning
Until relatively recently, most archaeological applications of lidar data were confined to generating multiple types of bare-earth visualizations and their derivatives, before manually identifying archaeological objects (e.g., Bennett et al. Reference Bennett, Welham, a. Hill and Ford2012; Cerrillo-Cuenca Reference Cerrillo-Cuenca2017; Chase et al. Reference Chase, Chase, Weishampel, Drake, Shrestha, Clint Slatton, Awe and Carter2011; Evans et al. Reference Evans, Fletcher, Pottier, Chevance, Soutif, Suy Tan and Sokrithy2013; Zakšek et al. Reference Zakšek, Oštir and Kokalj2011). However, recent developments in computer vision have broadened the analytical possibilities through DL models. DL is a subset of machine learning; it is called “deep” learning because it relies on computer models made of multiple connected layers. Convolutional neural network (CNN) models are a subset of DL models where connected layers perform convolutions (hence “Convolutional”) on the input to change its size and dimensionality, and where the layer connections resemble the structure of the human brain (hence “neural network”). CNNs are well suited for lidar data because they specialize in detecting patterns in imagery. Therefore, CNNs provide an efficient new way to automate—as much as possible—the detection of archaeological sites and features in lidar-derived imagery.
CNNs are usually separated into four categories based on the desired output: image classification, object detection, semantic segmentation, and instance segmentation. While many have successfully used CNN for object detection and image classification, here we use it for semantic segmentation because this method works particularly well on large and heterogeneously shaped archaeological objects such as terraces. Segmentation identifies the location of an object by classifying all pixels of an image that overlap with the object to that object’s class and classifying all other pixels to background or a different object’s class. In semantic segmentation, all instances of the same class share the same label, whereas instance segmentation goes a step further by separating each instance of the same class. Semantic segmentation training datasets require binary masks showing the presence/absence of the object classes to detect (Figure 1).

Figure 1. (Left) Example of terraces visible on an SLRM (20 m radius) tile and (right) its associated binary mask showing the location of terraces (white) versus the background (black).
The different CNN categories use different structures (e.g., YOLO [Redmon et al. Reference Redmon, Divvala, Girshick and Farhadi2016], U-Net [Ronneberger et al. Reference Ronneberger, Fischer, Brox, Navab, Hornegger, Wells and Frangi2015], Faster R-CNN [Ren et al. Reference Ren, Kaiming, Girshick and Sun2017], Mask R-CNN [He et al. Reference He, Gkioxari, Dollár and Girshick2017], and the more recent SAM [Kirillov et al. Reference Kirillov, Mintun, Ravi, Mao, Rolland, Gustafson and Xiao2023]). Furthermore, each structure can use different backbones (e.g., ResNet18 [He et al. Reference He, Zhang, Ren and Sun2015], VGG16 [Simonyan and Zisserman Reference Simonyan and Zisserman2014]), which define the type and number of convolution layers within each block of the general model structure. While they can be simply used as untrained structures, backbones can also come already pretrained on millions of images (e.g., ImageNet [Russakovsky et al. Reference Russakovsky, Deng, Hao, Krause, Satheesh, Sean and Huang2015], MS COCO [Lin et al. Reference Lin, Maire, Belongie, Bourdev, Girshick, Hays, Perona, Deva Ramanan and Dollár2014], or the Segment Anything 1-Billion mask dataset [Kirillov et al. Reference Kirillov, Mintun, Ravi, Mao, Rolland, Gustafson and Xiao2023]), which sets the weights of the nodes in the network. Using pretrained weights is called transfer learning. It assumes that the pretraining helped the model recognize general patterns, such as vertical or horizontal edges, and that we can fine-tune it by retraining only its last layers to recognize a new object, thus saving computational time and providing good results even with small training datasets. Research has shown that pretrained models perform better than untrained ones, even when they were trained on very different images (Nogueira et al. Reference Nogueira, Penatti and dos Santos2017). To date, archaeologists have successfully used pretrained YOLO and RetinaNet for object detection (Davis and Lundin Reference Davis and Lundin2021; Olivier and Verschoof-van der Vaart Reference Olivier and Verschoof-van der Vaart2021; Verschoof‐van der Vaart et al. Reference Verschoof‐van der Vaart, Bonhage, Schneider, Ouimet and Raab2023), U-Net for semantic segmentation (Bundzel et al. Reference Bundzel, Jaščur, Kováč, Lieskovský, Sinčák and Tkáčik2020; Gallwey et al. Reference Gallwey, Eyre, Tonkins and Coggan2019; Küçükdemirci et al. Reference Küçükdemirci, Landeschi, Ohlsson and Dell’Unto2023; Suh et al. Reference Suh, Anderson, Ouimet, Johnson and Witharana2021), and Mask R-CNN for instance segmentation (Bundzel et al. Reference Bundzel, Jaščur, Kováč, Lieskovský, Sinčák and Tkáčik2020; Carter et al. Reference Carter, Blackadar and Conner2021; Davis et al. Reference Davis, Caspari, Lipo and Sanger2021), among others. The first research projects using DL models focused on a limited set of relatively homogeneously shaped objects, such as charcoal kilns, pits, mounds, and barrows. However, methodological and computational improvements have allowed teams to expand the range of object types they could detect. Particularly relevant for our case study, two teams recently used image classification and instance segmentation models to detect agricultural terraces in lidar data (Herrault et al. Reference Herrault, Poterek, Keller, Schwartz and Ertlen2021; Quintus et al. Reference Quintus, Davis and Cochrane2023, respectively) with impressive results. Here, we continue to advance DL approaches in archaeological feature detection through our study of agricultural terraces in the Piedmont Region of Georgia, USA.
Case Study: Piedmont National Wildlife Refuge, Georgia, USA
The Piedmont National Wildlife Refuge (PNWR) is located in central Georgia, approximately 90 km southeast of Atlanta in the southeastern United States (Figure 2). The refuge is situated in the Piedmont geographic region of Georgia, which lies between the Blue Ridge Mountains and the low elevation coastal plain. This region is characterized by hilly terrain and red clay soil that was cleared for agriculture in the eighteenth and early nineteenth centuries and was intensively cultivated throughout the late nineteenth and early twentieth centuries. However, land degradation and the early twentieth-century boll weevil infestation led to the regional collapse of the cotton industry by the 1930s (Higgs Reference Higgs1976; Wettstaed Reference Wettstaed and Bernard2013). In an effort to mitigate economic hardship and restore the land, the US government created a program to convert cotton farms into public lands and relocate farmers to areas with greater economic opportunities (US Department of Agriculture 1965). By 1939, the US government had purchased almost 40,500 ha of land, out of which they assigned just over 12,000 to the US Biological Survey, with the goal of turning that land into a refuge (Wettstaed Reference Wettstaed and Bernard2013). Today, this land is managed by the US Fish and Wildlife Service (USFWS) as the Piedmont National Wildlife Refuge. In an area that was once called “one of the most severely eroded agricultural areas in the United States” (Garcia and Brook Reference Garcia and Brook1996:194), secondary-growth mixed pine and hardwood forests now cover most of the refuge, providing habitat for wildlife and recreational opportunities for visitors. However, the dense canopy and understory of the refuge’s forests make it difficult for USFWS archaeologists—who are responsible for managing all its archaeological and historical resources—to efficiently identify, inventory, and monitor new and known heritage sites. Our study therefore leverages aerial lidar datasets of the PNWR and CNN models to locate historical agricultural terraces throughout the refuge. Agricultural terraces were selected for this study because of their ubiquity on the landscape and their clear visibility in digital elevation models’ (DEMs’) derivative visualizations. Additionally, terraces are often located in association with other archaeological resources of interest for the PNWR CRM program, such as homesteads and historic cemeteries. This approach is the first step in our larger project, which aims to create feature detection tools and mapped archaeological datasets that will assist public lands cultural resource managers in inventorying and monitoring archaeological resources.

Figure 2. Study area of the PNWR and the coverage of the lidar-derived DEM (in grayscale) within its broader spatial context (the white arrow points to the white dot that shows its location). Basemap: ESRI satellite imagery.
Materials and Methods
Overview
For this study, we created a comprehensive workflow that takes lidar point-cloud data and ends with vector maps of terrace locations, using multiple open-access programs along the way (Figure 3). The R and Python scripts we used are available on our github repository (https://github.com/NMC-CRS/semantic-segmentation-of-historial-terraces-on-public-lands) accessible through Zenodo: https://doi.org/10.5281/zenodo.14107060; however, due to legislation protecting archaeological sites and collection on federal lands (see Johnson et al. Reference Johnson, Ives, Ouimet and Sportman2021) and the inherent sensitivity of their locations, we will only share raw data on request and with permission from the USFWS.

Figure 3. Workflow from raw data to mapped terraces.
In this section, we first present an overview of our chosen CNN models because that choice informed the training, validation, and testing dataset format requirements, which are discussed afterward. We then go back to the model structure and provide a more detailed description of the different steps we used to train, validate, and test the models, as well as how we automated post-training data cleaning.
CNN Models
To fit the size and sinuosity of terraces, we opted for U-Net (Ronneberger et al. Reference Ronneberger, Fischer, Brox, Navab, Hornegger, Wells and Frangi2015), a semantic segmentation CNN model structure, because its training dataset requires binary masks, which are more flexible than the bounding boxes required by popular object detection models such as Mask R-CNN. Generally speaking, U-Net models use encoder blocks and decoder blocks, which are linked by skipped levels (see Figure 4). The encoder blocks use convolutions to reduce the size of the image and its mask while increasing its dimensionality, whereas the decoder blocks use convolutions to restore the image and its mask to its original size and dimensions. Skipped levels between encoder and decoder blocks help the model retain the detailed patterns it learned during the encoder phase.

Figure 4. Figure of the different U-Net structures (done by CGM based on visualization in Ronneberger et al. Reference Ronneberger, Fischer, Brox, Navab, Hornegger, Wells and Frangi2015). Top: VGG16 and VGG19 (the gray boxes show the convolutions that are not present in VGG16), bottom: ResNet18.
We used Python 3.10 (van Rossum Reference van Rossum1995) and the PyTorch library (Ansel et al. Reference Ansel, Yang, He, Gimelshein, Jain, Voznesensky and Bao2024) to design three different U-Net structures that each used a different encoder backbone (VGG16 and VGG19 [Simonyan and Zisserman Reference Simonyan and Zisserman2014] and ResNet18 [He et al. Reference He, Zhang, Ren and Sun2015]) from the torchvision package (Ansel et al. Reference Ansel, Yang, He, Gimelshein, Jain, Voznesensky and Bao2024; Figure 4). This choice of backbones was informed by the successful work of others (Meng Reference Meng2023; Somrak et al. Reference Somrak, Džeroski and Kokalj2020; Trier et al. Reference Trier, David and Anders2019; Verschoof-van der Vaart and Lambers Reference Verschoof-van der Vaart and Lambers2019; Verschoof-van der Vaart et al. Reference Verschoof-van der Vaart, Lambers, Kowalczyk and Bourgeois2020), the complexity of popular backbones, and our available computing resources. We used the latest available versions of the pretrained backbone weights (last downloaded between June 1 and June 15, 2024) to speed up training. To use these weights, we formatted our training dataset into square, three-band tiles to fit the pretrained models’ training dataset.
Creating the Training, Validation, and Testing Datasets
Raw Data. Aerial lidar data collection was performed by Kucera International on January 4–5, 2021, under contract for the USFWS and the US Department of Agriculture (USDA) Forest Service. They used a PA-31 Piper Navajo Chieftain aircraft with a Leica ALS80 System mounted to its underside for the lidar acquisition flights. Data was collected over the entirety of the PNWR in 63 flight lines, with an average point density of 50–60 points/m2 and point spacing of 0.23 m. Average elevation variation was calculated at 0.01 m from eight ground control points placed throughout the refuge. Horizontal accuracy was reported at <1 m in the metadata generated by Kucera International. The final datasets include calibrated, georeferenced, and unclassified lidar point clouds (LAZ files) separated per flight line.
We used R (version 4.2.0 [R Core Team 2021]) and the lidR package (version 4.0.2 [Roussel et al. Reference Roussel, Auty, Coops, Tompalski, Goodbody, Sánchez Meador, Bourdon, de Boissieu and Achim2020]) to classify ground points within the point clouds and create a digital elevation model (DEM). We first aggregated the compressed lidar aerial survey (LAZ) files to merge overlaps from different flight lines, and then tiled that dataset to reduce the computational memory required for classification. We then used a progressive morphological filter (PMF) algorithm to classify ground points within each LAZ tile (Keqi Zhang et al. Reference Zhang, Chen, Whitman, Shyu, Yan and Zhang2003). The PMF algorithm deploys an iterative moving-window approach to distinguish ground points from vegetation, structures, and other objects. We used a triangular irregular network (TIN) algorithm and the resulting ground-classified point clouds to generate a rasterized, bare-earth DEM for the PNWR. While the ground-classified point density (20.84 points/m2) allowed us to create a DEM at a higher resolution, we decided to use a resolution of 1 m to keep our computational time relatively fast. Terraces are highly visible in 1 m resolution DEMs, and using a higher resolution would have considerably increased the number of tiles to feed the model (see the model workflow later), thus increasing its computation time. Additionally, 1 m resolution DEMs are a standard lidar data product created by the US Geological Survey’s (USGS) 3D Elevation Program, and thus we designed our workflow to be compatible with other high-resolution lidar datasets elsewhere in the United States.
DEM Visualizations. To enhance the visibility of terraces on the landscape, we used the DEM to create different visualization maps. We followed previously published archaeological studies, which demonstrated that slope, Terrain Ruggedness Index (TRI), Positive Openness (PosOp), and Simple Local Relief Model (SLRM) visualizations obtained the best balance between computation time and model accuracy (Meng Reference Meng2023; Quintus et al. Reference Quintus, Davis and Cochrane2023; Somrak et al. Reference Somrak, Džeroski and Kokalj2020; Suh et al. Reference Suh, Anderson, Ouimet, Johnson and Witharana2021). Therefore, we used the slope and TRI tools in QGIS (QGIS.org 2023), along with the Relief Visualization Toolbox (Kokalj and Somrak Reference Kokalj and Somrak2019; Zakšek et al. Reference Zakšek, Oštir and Kokalj2011), to create those visualization maps from the lidar-derived DEM (Figure 5). To enhance the visibility of different-sized terraces, we created two SLRM maps (10 m and 20 m radii). All visualization maps are one-band geoTIFF images; however, the pretrained model requires three-band images. Therefore, to take advantage of the different information each visualization provides, we combined three different visualizations into three-band images for training. In Python, images are represented as two-dimensional arrays of numbers (height × width). To combine different visualizations, we stacked these arrays along a new dimension, creating a three-dimensional array (height × width × band).

Figure 5. Set of terraces visible in different visualization one-band maps (SLRM with 10 m radius, PosOp, and Slope) as well as the RGB they produce when combined as a three-band tile (where SLRM is used for the red channel, PosOp is used for the green channel, and Slope is used for the blue channel).
We then wrote our own Python script to cut all visualization maps into overlapping small tiles (256 × 256 pixels). To produce the tile overlap, we used a 128-pixel vertical and horizontal stride. Creating this overlap ensured that any terraces cut by the edge of a tile would be more complete in a corresponding overlapped tile. The script automatically removed any tile where >50% of the pixels had no data (tiles at the edge) to avoid feeding tiles with large sections of missing data to the model. The script also normalized the values of each tile to 0–1 using its minimum and maximum values.
Annotations. Training a CNN model requires detailed and accurate information on the objects it learns to recognize. Therefore, a model will perform better if it is provided multiple different versions of the object to learn. Creating a good training dataset of archaeological objects can be difficult because archaeological data are sparse and because taphonomic processes may lead similar objects to look different based on their age. Fortunately, our study area encompasses a relatively large area (265 km2) containing thousands of legacy cotton terraces that we were able to use for training. To help us and others determine the minimum necessary training dataset size, we decided to annotate as many of the terraces as we could find in the whole area covered by the lidar point cloud, which extends beyond the administrative boundary of the PNWR. As discussed later, this extensive annotation allowed us to test training models with different numbers of terraces to identify an ideal workflow for other case studies with fewer resources.
To ensure that our methods could be replicated, we defined an annotation workflow implemented in QGIS. We divided the study area into a grid of 2.5 × 2.5 km zones (6.25 km2) further subdivided into 0.5 × 0.5 km cells (0.25 km2) to organize feature annotations and keep track of progress. Those grids have no impact on the resulting annotations. We then switched back and forth between our SLRM, slope, and hillshade maps, as well as USGS topographic maps, to visually identify terraces within the entirety of the PNWR (toggling between different visualizations allowed finding terraces that may be more visible in one visualization than in another). Because the exact extents of the terraces’ flat areas were difficult to identify from the imagery, each terrace was annotated as a polyline following its more visible contour along the hillslope. We did not annotate lines created by modern furrows from silviculture parcels because our focus was on legacy cotton terraces. Due to the heavy erosion that has impacted this landscape since the start of the twentieth century, multiple terraces were disrupted by gullying; in these situations, we attempted to connect the terrace on each side of the gully to create continuous terraces that follow their original contours rather than short, discontinuous segments. All annotations were assigned a confidence value (low or high) based on how evident the terraces were in the visualizations, and low-confidence annotations were revisited and reassessed after all zones were completed. To enhance our quality assurance/quality control (QA/QC) protocols, terrace annotations were also double-checked by a senior member of our research team who added missing lines, merged different parts of single eroded terraces, and removed lines that were not terraces. This annotation process took approximately one month of intermittent work. The resulting dataset includes 9,223 terraces. The annotated terrace lengths range from 8.16 to 1,391.41 m, with a median of 78.86 m and a mean of 101.21 m.
To create raster binary masks from the annotated terraces, we converted all polylines to polygons using QGIS’s buffer tool. We generated buffers of 5 m, 10 m, and 20 m to provide the model with information about the context of each terrace (see Verschoof‐van der Vaart et al. [Reference Verschoof‐van der Vaart, Bonhage, Schneider, Ouimet and Raab2023] for discussion on buffers around objects). We then rasterized the polygons dataset to create a binary map of the whole region, where 0 represents the background and 1 represents the presence of at least one terrace buffer. The mask maps have the same extent and resolution as the visualization maps to ensure that their cut tiles overlap perfectly with the visualization tiles that share their location. We finally used the Python script mentioned above to cut the mask maps into the same tiles as the visualization maps. In summary, the visualization and annotation portions of the workflow created eight sets of normalized tiles (five visualizations and three annotations).
Implementing the U-Net in PyTorch
We wrote a Python script that automatically runs through all the steps required to organize and augment the training tiles, train the model, and evaluate it on validation and testing data. For this workflow, we used PyTorch (Ansel et al. Reference Ansel, Yang, He, Gimelshein, Jain, Voznesensky and Bao2024), Albumentations (Buslaev et al. Reference Buslaev, Iglovikov, Khvedchenya, Parinov, Druzhinin and Kalinin2020), numpy (Harris et al. Reference Harris, Jarrod Millman, van der Walt, Gommers, Virtanen, Cournapeau and Wieser2020), sklearn (Pedregosa et al. Reference Pedregosa, Varoquax, Gramfort, Michel, Thirion, Grisel and Blondel2011), torch_snippets (Sizhky; torch_snippets.github.com/sizhky/torch_snippets,v0.533), and the base random and os Python libraries. To ensure reproducibility and allow result comparison across different trainings, we set the random seeds at the beginning of our script.
The first part of this workflow is to define the training, validation, and testing datasets. The script first identifies the set of mask tiles that have at least a certain number of pixels with the value 1 (i.e., pixels that belong to a terrace buffer), thus ignoring “negative” tiles (i.e., tiles that do not have terraces). Our script allows us to change this pixel threshold to evaluate the impact of such a parameter on model training. This procedure creates a list of tile names to use for training, validation, and testing. The script then separates that list into training, validation, and testing subsets and removes overlapping tiles if they are in different datasets to ensure that no features are found in more than one dataset. The separation can be random (80% training, 10% validation, and 10% testing) or geographically set from a bounding box provided by the user. When the separation is geographical, all tiles within the bounding box are used for training. The remaining tiles are divided into two equal areas divided by the mean longitude (validation on the left and testing on the right). The model is trained only on the training dataset; the validation dataset is used during training only to calculate its performance on new data, and the testing dataset is used after training to evaluate the model’s performance on previously unseen data. Comparing the in-training metrics on training and validation datasets allows us to identify overfitting (i.e., when a model learns the training images instead of learning the pattern in them) and stop training before it occurs.
The second part of the workflow is to load and preprocess the training dataset tiles. CNN models written in Python use a “dataloader,” which is a block of code that loads training images in batches, preprocesses them, and runs them through the model. For our study, preprocessing involved combining three one-band visualizations to form three-band tiles and changing their appearance through augmentations from the Albumentations Python library (Buslaev et al. Reference Buslaev, Iglovikov, Khvedchenya, Parinov, Druzhinin and Kalinin2020). Augmentations are slight changes that are applied with a given probability to each image when they are loaded in their batch (Table 1). When training a model over multiple epochs (each epoch is one complete pass over the whole dataset), different augmentations are applied to the same tiles at each epoch, which increases the number of different images the model trains on.
Table 1. Augmentations Used in This Research.

Note: The model runs each image through each augmentation probability from the top down; therefore, one image can be blurred first, then flipped both horizontally and vertically, and finally rotated.
The third part of this workflow is the training, validation, and testing of the model. We relied on the torchvision library (Ansel et al. Reference Ansel, Yang, He, Gimelshein, Jain, Voznesensky and Bao2024) to create the structure of our U-Nets and load the pretrained weights of the backbones we selected. However, we made a few changes to the code to fit our project’s needs. First, we froze the weights of all encoder nodes by setting their learning rate to zero to fine-tune the pretrained models; therefore, only the decoder nodes learned the new pattern shown in the training dataset. Second, we added a 30% dropout layer at the base of the decoder blocks to reduce overfitting. Dropout layers deactivate a certain percentage of nodes for that layer (in this case, 30%), which forces all active nodes to learn the pattern better (Srivastava et al. Reference Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov2014).
Because there are no agreed-upon parameter values that perform best for models trained on archaeological objects, we tested different combinations on randomly separated datasets (Table 2). We proceeded iteratively, first comparing the results obtained by the three backbones. We selected the best backbone based on the metrics detailed below and used it to train different models using a different loss function. We then selected the best backbone and loss combination and used it to train models using different learning rates, and so on until we had one set of parameters that performed best. While this approach may have missed model parameters that could have provided better results, it allowed us to identify a satisfactory model quickly.
Table 2. Parameters Tested in the Order They Were Tested.

Note: For each row, the values are presented in the order they were tested. The value in bold shows the choice that obtained the best validation recall.
We initially ran all those exploratory models for 30 training epochs, at which point we logged performance metrics (accuracy, recall, precision, F1, and the Matthew Correlation Coefficient [MCC]) calculated on the pixels of training, validation, and testing tiles. Accuracy and MCC take into consideration the true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN; see equation 4), whereas recall, precision, and F1 overlook FN (see equations 1–3 below). While accuracy, recall, precision, and F1 have extensively been used (e.g., Abdollahi et al. Reference Abdollahi, Pradhan, Shukla, Chakraborty and Alamri2020; Anttiroiko et al. Reference Anttiroiko, Jan Groesz, Ikäheimo, Kelloniemi, Nurmi, Rostad and Seitsonen2023; Freeland et al. Reference Freeland, Heung, Burley, Clark and Knudby2016; Li et al. Reference Li, Zhe, Wei, Gong, Zhang and Chen2023; Olivier and Verschoof-van der Vaart Reference Olivier and Verschoof-van der Vaart2021; Quintus et al. Reference Quintus, Davis and Cochrane2023; Suh and Ouimet Reference Suh and Ouimet2023; Verschoof-van der Vaart et al. Reference Verschoof-van der Vaart, Lambers, Kowalczyk and Bourgeois2020; Verschoof-van der Vaart and Lambers Reference Verschoof-van der Vaart and Lambers2019), MCC (equation 4 below) is better suited to handle imbalanced datasets (Chicco and Jurman Reference Chicco and Jurman2020), which are common in archaeology. For our purpose, we paid close attention to MCC and recall but decided to select the model with highest recall on validation dataset because we find it more important to optimize the number of archaeological sites and features detected than to minimize the number of false detections that can be manually excluded afterward.




To fulfill the first aim of this study, we used the parameter settings of the best model to train a new model over 200 epochs. For this model, we separated the training/validation/testing datasets geographically; the training dataset (∼80% of all terraces) is the north part of the refuge (see Figure 6), whereas the remaining south portion is divided in half along the x axis to create the validation and the testing datasets. We logged the training and validation metrics over each epoch to look for overfitting and to identify the epoch with the highest validation recall and MCC, which we deemed the optimal epoch. We ran a new training using the same parameter combinations but ended it at that optimal epoch, which is equivalent to stopping the training when improvement stops, a practice commonly used to improve model performance (Banasiak et al. Reference Banasiak, Berezowski, Zapłata, Mielcarek, Duraj and Stereńczak2022; Davis et al. Reference Davis, Caspari, Lipo and Sanger2021; Herrault et al. Reference Herrault, Poterek, Keller, Schwartz and Ertlen2021; Quintus et al. Reference Quintus, Davis and Cochrane2023). For the second aim of this research, we used the same settings to train models using different percentages of training versus validation/testing datasets (Table 3). To create those datasets, we simply changed the size of the training dataset’s geographical bounds to roughly encompass a given percentage of all annotated terraces. The rest of the map was used for validation and testing, again separated equally along the x axis; therefore, a model training on 10% of tiles was validated on a different 45% of tiles and tested on the remaining 45%. We defined the training areas starting from the south portion of the map, because it is narrower and thus easier to cut into areas small enough to get as close as possible to the percentage needed.

Figure 6. Extent of the different training datasets against the extent of the lidar data (black) and the annotated terraces (white lines).
Table 3. Number of Tiles Used for Each Training on Smaller Datasets.

We ran those trainings for 75 epochs and logged their metrics at every epoch to see if a model trained on a reduced training dataset for a long time performs as well as a model trained for fewer epochs on a larger dataset.
Calculating Metrics on Objects. Default model output metrics evaluate performance on a pixel-by-pixel basis, which is useful to follow a model’s performance during training; however, that level of precision is not necessary for the detection of archaeological objects. Therefore, to calculate more representative measures, we developed a Python script that calculates the metrics using object overlap instead of pixel values and allows us to denoise the predictions beforehand to remove unrealistic predictions (i.e., prediction areas that are smaller than any of the annotated terraces). The script uses the tiles that are created by the model at the end of training. Those tiles are the prediction of the presence/absence of a terrace in each of the testing tiles fed into the model during the testing step. The script merges those tiles into one raster and polygonizes it so that all touching pixels with a value of one form a polygon. The script then deletes any predicted polygon with an area smaller than a given threshold to perform an automatic first-pass denoising.
When provided with a shapefile of the actual annotated terraces covering the same area as the testing dataset, the script then uses the overlap of the actual vs. predicted polygons to compute the metrics object by object rather than pixel by pixel (Figure 7). Since these metrics are calculated after unrealistic predictions are removed, and archaeologists would look for entire terraces rather than meters of terrace cover, we believe that those metrics are a better representation of the model’s performance.

Figure 7. Example of object-by-object metric workflow. Windows A–C are from a different area to windows D–E: (A) predicted terraces shown as black pixels; (B) polygons with their area (m2) —the polygon’s color shows if they will be automatically deleted (dark gray) or if they will be kept (white) based on the 1,600 m2 denoising threshold; (C) denoised predicted polygons; (D) example of annotated terraces; (E) superposition of the annotated terraces with the predicted terraces. The fill colors show how they are used to calculate the object-per-object metrics; TP and FN are annotations that do and do not overlap with predictions, respectively, whereas FP are predictions that do not overlap with annotations.
We used this script to compute the object-by-object metrics on the predictions of the best model at its optimal epoch, as well as on the predictions of all models run on smaller training datasets. To define the denoising threshold, we used the rounded area of the smallest buffered annotated terrace (1,600 m2), and we tested two bigger thresholds (2,000 m2 and 2,400 m2) to see how they improved precision and the F1 score without reducing recall too much. We visualized the detection map created by our best denoised model to manually investigate the FPs and FNs, correct any mistakes in our annotations, and better understand the strengths and weaknesses of our model. Finally, to make our results comparable with a larger set of studies, we rasterized the corrected final predictions and compared that map with the raster map of annotated terraces to compute MCC.
Contextualizing the Terraces
Some results discussed below led us to investigate the spatial context of the terraces used for model training. For this investigation, we created heatmaps of the location of all annotated terraces to get an overview of their location, as well as a resampled slope map at a coarser resolution to get a more general view of the slope in each area. Both maps were created in QGIS (QGIS.org 2023).
To create the terrace heatmaps, we reduced the buffer annotations to their centroids, which we used for a Kernel Density interpolation with a radius of 1,000 m and one with a radius of 250 m. To create the coarser slope map, we used the GRASS r.resamp.stats tool (GRASS Development Team 2024) in QGIS, using the median value and exporting the result to a resolution of 100 m2. We then used the v.what.rast tool (GRASS Development Team 2024) to extract the coarse slope, and the two heatmap values at the location of each terrace centroid. We used the tidyverse package (Wickham et al. Reference Wickham, Averick, Bryan, Chang, McGowan, François and Grolemund2019) in R to visualize those values, grouped geographically.
Results
We trained our models on an Intel Core i9-14900K 3.20 GHz desktop computer with a NVIDIA GeForce RTX 4090 graphics card with 24 GB of dedicated memory. The 30-epoch iterative trainings took 22–30 minutes each, the best-run 200-epoch training took 2 h 40 min, and the 75-epoch smaller dataset trainings took 20 min to 1 h, depending on the size of the training dataset.
Best Model
The best model used the VGG19 backbone with the latest pretrained weights, the AdamW optimizer, IoU loss, learning rate stable at 0.001, and a combination of slope, PosOp, and TRI 256 × 256-pixel tiles uploaded in batches of 16 tiles, where masks tiles were created from a 20 m buffer around the annotations, and where tiles with positive masks smaller than 1,000 pixels were ignored.
Although the rate of improvement in validation loss decreased after epoch 20, it continued to improve slightly, indicating that our model is not overfitting (Figure 8). Using the combination of MCC and recall on the validation dataset, we can see that the model performs best at epoch 56; therefore, our best model (henceforth called model N for reasons we explain below) was trained using the parameter settings identified above for 56 epochs.

Figure 8. Smoothed metrics per epoch for the 200-epoch training of the best parameters (smoothing span 0.2). The gray represents the 95% confidence interval around the smoothed value.
At the end of the 56 epochs, model N achieved pixel recall = 0.67, precision = 0.64, F1 score = 0.66, and MCC = 0.55 on the validation dataset. The object-by-object recall metrics obtained on the denoised predictions on the test area indicate that model N was able to correctly predict 88%–90% of terraces found on the landscape, even if our precision metrics were slightly lower (Table 4). Increasing the denoising threshold from 1,600 m2 to 2,000 m2 removes 31 FP but only 5 TP, whereas further increasing the threshold to 2,400 m2 removes 34 additional FP and 8 TP, which is less of an improvement. Therefore, we selected the predictions larger than 2,000 m2 for the visual evaluation, through which we were able to further improve precision.
Table 4. Object-by-Object Performance Metrics after Automatic Denoising of the Predicted Vector Map.

Note: The denoising involved deleting predicted polygons smaller than the given threshold. The lowest and highest values in each metric are shown in italic and bold, respectively.
Visual Evaluation. Visual examination of model N’s denoised predictions on the test area showed that it identified 11 new terraces that we had missed during annotation. For the most part, those terraces were relatively small and indistinct (see left prediction in Figure 9a). Interestingly, many of the remaining 115 FP were either straight lines that may have been old land boundaries or pedestrian paths, both of which can easily be mistaken for terraces because they sometimes follow slopes (see path in Figure 9b). Therefore, these results show that, in general, model N performed very well when identifying the pattern of terraces on sloped landscape.

Figure 9. Examples of true positives (a), false positives (b), and false negatives (c) identified by the model: (a) small and indistinct terrace (left) and more regular terraces that are likely modern (right); (b) pedestrian path that was incorrectly flagged; (c) annotated terrace not predicted by the model that is likely a pedestrian path. The basemaps represent slope overlayed on a hillshade map for increased topography visibility. All white scale lines represent 50 m.
Conversely, FN (the annotated terraces that were not predicted by the model) are mostly small and isolated lines, three of which seem to have been misclassified by our team during annotation (see Figure 9c). We removed those three annotations and recalculated the object-by-object metrics of this model after manually reclassifying FN and FP, which resulted in an even better model performance (recall = 0.901, precision = 0.850, and F1 = 0.875). The rasterized version of those cleaned predictions led to a satisfactory MCC score of 0.562.
Minimum Training Dataset Size
To evaluate the impact of training size on a model’s performance, we plotted the metrics over epochs for all trainings done with smaller training datasets for 75 epochs. The validation loss, recall, precision, and MCC metrics of those trainings improve as the size of the dataset increases; however, that improvement is not linear, and its rate varies by dataset size (see Figure 10). The model trained on only 5% of the dataset has the worst value for most metrics, except validation recall, which is better than model N’s (an interesting result we discuss below). The curves of the 5% and 10% models’ validation recall differ also from other models’ curves; they decrease substantially during the first ∼20 epochs, whereas for all other models, the first ∼20 epochs show a steady improvement that slows down afterward. This is an interesting difference, especially given that the curves of their other metrics (loss, precision, and MCC) follow the general pattern of the other models. This suggests that results can be erratic when the training dataset is very small (in this case, only 541 terraces over 276 training tiles for the 5% training). Similar to the pattern seen with model N, all metrics improve considerably during the first ∼20 epochs, after which they still improve, but at a slower pace.

Figure 10. Metrics per epoch for each model trained on different dataset sizes. The gray represents the 95% confidence interval around the smoothed value (span 0.2). Refer to Figure 6 for the extent of the training dataset for each model.
The training metrics of all models converge to the same approximate values after 75 epochs, while most validation metrics stay separate. In terms of the training metrics, this suggests that the models have learned as much as they can from their training dataset after 75 epochs. The validation metrics may suggest that models trained on smaller datasets are not as transferable to new data as models trained on larger datasets; however, the difference between the metrics decreases as the size of the training dataset increases. This suggests that increasing the size of a small training dataset leads to more substantial improvements than increasing the size of an already large training. Moreover, as discussed below, carefully choosing the quality of the training dataset has as much of an impact, if not more, on its models’ performance.
One intriguing result is that model N appears to perform poorly, in terms of recall, when compared with models trained on smaller datasets. More importantly, recall improves as the training size increases for all smaller models, but the model using a similar percentage of training tiles as model N still outperforms it. This result came as a surprising check on our data quality; for model N, the training dataset was the 80% north part of the map, whereas for the smaller models, we used the south part because it was narrower and thus easier to separate to reach the iterative percentages we wanted. This difference created very different results: a model trained on northern terraces performs worse when predicting southern terraces than vice versa. We investigated possible causes for this pattern by examining heatmaps of terrace locations and the coarser slope map (see Figure 11 top). We found that the model trained on the 80% southern terraces (called model S henceforth) was trained on terraces located on steeper slopes than the landscape it was validated on, whereas model N was trained on terraces positioned on slightly gentler slopes than it was validated on. Given that terraces located on steeper slopes are easier to distinguish than those located on gentler slopes, this pattern was contrary to what we expected and did not explain why model S performed best. Instead, we found that the aggregation of terraces was a better explanation: model S was trained on more isolated terraces than model N, which was trained on regions that had larger aggregations of terraces (Figure 11 bottom). As the masks of multiple terraces located near one another would have blended to create big masks, model N would have become much better at recognizing groups of terraces than at recognizing individual ones, thus leading to a higher level of FN when validated on southern areas that had more isolated terraces. On the other hand, model S learned to recognize the visual signature of individual terraces, which could then be applied more easily to their aggregations in the northern validation areas.

Figure 11. Values extracted at the location of terrace centroids. Top: median slope within a 100 m radius. Bottom: number of terrace centroids within a 250 m radius. The notches show the 95% confidence interval around the median.
While the in-training pixel metrics suggest that a larger training dataset is best, the object-by-object metrics calculated on denoised prediction maps covering the 10% testing area (Table 5) show that the finished products of the different trainings are very similar. When denoising at threshold 2,000, recall ranges from 0.922 to 0.964, and precision from 0.856 to 0.899, which are very good values. The tight range of recall values suggests that we can obtain very good results even with models trained for 20 min on small training datasets. A visual comparison of the 5% to the 80% models’ predictions (Figure 12) supports this conclusion.

Figure 12. Comparing some of the test predictions of two models trained on southern terraces.
Table 5. Object-by-Object Metrics Calculated on Denoised Predictions (Using Threshold 2,000) Filtered to Focus Only on the Northeastern Region That Was Used for Testing in All Models.

Figure 12 shows that while the southern-trained model trained with 80% of the terraces produces predictions that follow more closely the contours of the annotations, in general the 5% trained model predicts most of the same areas as having terraces. Therefore, small numbers of annotations for training should still lead to useful results for CRM companies.
Final Map
To obtain a general map of all terraces, we applied the best model to the whole area covered by the PNWR and removed all predictions smaller than 2,000 m2 (Figure 13). We did not calculate the metrics on this dataset because they would be misleading, given that the model was trained on most of those images already. While we were not able to ground-truth our model’s predictions, we shared this map with the PNWR archaeologists and resource managers to help them inventory and manage archaeological terraces moving forward. They will evaluate the predictions if and when they need to manage those archaeological resources.

Figure 13. Terraces predicted by our best model cleaned using 2,000 threshold and clipped to the extent of the PNWR.
Discussion
This research adds to recent applications of CNN to detect archaeological terraces (Herrault et al. Reference Herrault, Poterek, Keller, Schwartz and Ertlen2021; Quintus et al. Reference Quintus, Davis and Cochrane2023) as it provides a robust model that is sensitive enough to detect the vast majority of terraces. In terms of the general research on using semantic segmentation on archaeological objects, our methodology reached performance metrics that are comparable to the work of others who similarly used U-Net. Notably, our MCC metrics did not reach the high score obtained in the study by Suh and Ouimet (Reference Suh and Ouimet2023) on stone walls in the US Northeast; however, we would argue that such straight walls are easier to recognize in imagery than the sinuous edge of terraces found in our dataset. Our final results are more comparable to the works of Bundzel and colleagues (Reference Bundzel, Jaščur, Kováč, Lieskovský, Sinčák and Tkáčik2020) who obtained an MCC score of 0.618 on Mayan structures, Gallwey and colleagues (Reference Gallwey, Eyre, Tonkins and Coggan2019), who got recall values from 0.80 to 0.83 on their segmentation of mining pits in England, and the research by Banasiak and colleagues (Reference Banasiak, Berezowski, Zapłata, Mielcarek, Duraj and Stereńczak2022), which obtained F1-score 0.79 on the validation dataset of their model trained to recognize eight different archaeological objects in Poland. Those studies influenced our choice of ML model structure, as they showed that U-Net could successfully be used to find irregular objects in lidar-derived imagery. Moreover, while Herrault and colleagues (Reference Herrault, Poterek, Keller, Schwartz and Ertlen2021) showed that terrace locations can be obtained using simpler models, such as Random Forest, the model we present here is only the first of a set of models we will train to recognize different archaeological objects (e.g., homesteads, cemeteries, trails, even looter pits), for which we thought we could benefit from the flexibility and code simplicity of the U-Net structure. We also show that our additional denoising steps are effective at improving the results, and thus provide a more accurate and useful map of potential terraces to our USFWS partners who will use this dataset to inform future archaeological surveying and monitoring at the PNWR.
This study expands the application of DL in cultural resource management for land-managing agencies by advancing methods for archaeological inventorying, monitoring, and preservation at a landscape scale. The high recall values obtained by our model can amplify an agency CRM program’s ability to effectively conduct pedestrian survey and testing in areas with high potential for encountering historical infrastructure and other sites of interest. For example, the archaeological cotton terraces of the PNWR provide a landscape-scale inventory of intensive agricultural land use (Figure 13) that may help pinpoint the locations of associated homesteads, historic roads/grades, and other features. The scale and distribution of historic terraces, coupled with a lack of ground visibility in this region, means that this would not be possible with pedestrian survey alone. Given the statuary obligations of federal agencies to manage cultural resources on their properties, this approach to DL has the potential to greatly expand their capabilities and efficiencies. Moreover, this method is relatively quick and computationally inexpensive, given that it uses transfer learning of pretrained models. The most time-intensive part of this project was annotating all the terraces we could see on visualization maps; however, we showed here that our model can successfully learn the pattern with only 5% of the annotations we created, which will help us better approach our other models’ needs in the future. Moreover, we now possess a set of trained models we can apply to new partnering refuges to find more terraces. Therefore, while this specific project still required a considerable amount of time, it will decrease the time required to detect terraces in new areas.
The use of DL models to detect archaeological objects unfortunately has drawbacks to balance its advantages. For one thing, it still requires a considerable amount of time and a high level of coding proficiency to create and train models. While commercial GIS software now includes DL tool kits with a graphical user interface that can replace the code development part of this process, we found that that method did not allow us as much flexibility in the parameters, as well as enough information on model training, as we wanted, which is why we chose to develop our own Python model instead. However, the coding skills involved with creating this workflow are not available to all archaeologists or cultural resource managers. Fortunately, we are building alliances with CRM organizations to help them circumvent this very time-intensive step. Another potential drawback is that, even with a good model structure, DL requires good data to produce good outcomes (the classic “garbage in, garbage out”). Therefore, CNN models applied on lidar-derived imagery rely on good point-cloud data to provide good visualization maps. Fortunately for our project, our point density remained consistently high throughout the refuge, which allowed the model to detect terraces in all areas. However, we are aware that these conditions do not exist everywhere, and additional modifications may need to be made in areas with higher topographic roughness (see Quintus et al.’s [Reference Quintus, Davis and Cochrane2023] discussion on this topic).
For the PNWR, this study provides welcome data on the location of cotton terraces that were abandoned almost a century ago and were not catalogued in their CRM database. This research reveals a forgotten, intensively cultivated, and ultimately unsustainable agricultural landscape which continues to influence contemporary erosion, forest vegetation, and soil health. Our research shows just how common terraces are, despite their infrequent consideration in archaeological research in the US Southeast. Moreover, as mentioned above, this updated terrace database will assist in further research to locate historic homes and farmsteads for which we have some historical records and inventories (Hartman and Wooten Reference Hartman and Wooten1935), but no precise locations.
Acknowledgments
We specifically thank Rick Kanaski and Jon Wallace of the USFWS for their support and subject-matter expertise during this project.
Funding Statement
This material is based on work supported by the US DOI FWS award #F22AC02819 and the USDA Forest Service award #22-PA-11080600-228. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the US DOI FWS and USDA Forest Service. These institutions are equal opportunity providers.
Data Availability Statement
Lidar data are available by request by contacting the USDA Forest Service Southern Research Station via their website (https://www.fs.usda.gov/research/srs/contactus). Archaeological data may be available to interested researchers by contacting the USFWS’s Piedmont National Wildlife Refuge via their website (https://www.fws.gov/refuge/piedmont/contact-us). All of the code and scripted analyses used in this study are available on our github repository (https://github.com/NMC-CRS/semantic-segmentation-of-historial-terraces-on-public-lands) or via the following DOI: https://doi.org/10.5281/zenodo.14107060. To help others learn how to use our code without sharing protected archaeological data, we created a dummy dataset of annotated roads in the Juliette community, which neighbors the PNWR. That dataset can be found along the code in the github repository.
Competing Interests
The authors declare none.