Introduction
Circular dichroism (CD) spectroscopy is a widely utilized technique for studying the structures of chiral molecules (Gottarelli et al., Reference Gottarelli, Lena, Masiero, Pieraccini and Spada2008; Nordén et al., Reference Nordén, Rodger and Daffron2010). While extensively employed in biology to determine the secondary structure of proteins (Greenfield, Reference Greenfield2006), CD can also be applied to investigate other chiral biomolecules, such as nucleic acids, which include many forms (Gray et al., Reference Gray, Liu, Ratliff and Allen1981; Steely et al., Reference Steely, Gray and Ratliff1986; Johnson, Reference Johnson1990; Gray et al., Reference Gray, Hung and Johnson1995; Kypr et al., Reference Kypr, Kejnovska, Renciuk and Vorlickova2009; Del Villar-Guerra et al., Reference Del Villar-Guerra, Trent and Chaires2018). This makes CD suitable for studying their folding patterns as well, which is important for understanding the functions of nucleic acid sequences. The interest of using CD measurements is that it is a fast, non-destructive method to identify nucleic acid folding. Compared to other techniques, such as crystallography, NMR, or cryo-electron microscopy that provide 3D information (Neidle and Sanderson, Reference Neidle and Sanderson2022a, Reference Neidle and Sanderson2022b), CD can be applied to water solutions of nucleic acids without impacting their structure. For protein studies, a circular dichroism structural database (PCDDB) exists (Ramalli et al., Reference Ramalli, Miles, Janes and Wallace2022), enabling the indexing of unknown structures by comparing their CD spectra to those referenced in the PCDDB and other sources (Micsonai et al., Reference Micsonai, Moussong, Wien, Boros, Vadaszi, Murvai, Lee, Molnar, Refregiers, Goto, Tantos and Kardos2022; Nagy et al., Reference Nagy, Hoffmann, Jones and Grubmuller2024). For nucleic acids, a similar database exists called the NACDDB, comprising previous and new spectra (Cappannini et al., Reference Cappannini, Mosca, Mukherjee, Moafinejad, Sinden, Arluison, Bujnicki and Wien2023). However, due to the flexibility, structural variability, and greater repartition of electronic transitions within a larger distance compared to the peptide bond, the number of possible spectra observed for polynucleotides is extensive compared to proteins. Four major secondary structural types have been assigned in proteins (α-helix, β-sheet, turn, and random coil) (Manavalan and Johnson, Reference Manavalan and Johnson1987; Sreerama and Woody, Reference Sreerama and Woody1994; Wallace, Reference Wallace2009; Kuril et al., Reference Kuril, Vashi and Subbappa2024). For protein CD, basis spectra have been correlated to secondary structure elements. They correspond to secondary structure subclasses, distinguishing regular and distorted α-helices, parallel β-sheets (including three distinct twisting patterns) and antiparallel β-sheets, turns, and others (Micsonai et al., Reference Micsonai, Moussong, Wien, Boros, Vadaszi, Murvai, Lee, Molnar, Refregiers, Goto, Tantos and Kardos2022;Burastero et al., Reference Burastero, Jones, Defelipe, Zavrtanik, Hadži, Hoffmann and Garcia-Alai2025). Multivariate statistical analysis performed in the NACDDB has revealed a greater number of potential reference spectra compared to proteins due to their numerous sequence specificities and chemical variations (Cappannini et al., Reference Cappannini, Mosca, Mukherjee, Moafinejad, Sinden, Arluison, Bujnicki and Wien2023).
As a result, there is currently no well-established list of reference spectra for known nucleic acid secondary structures, which hinders the assignment of unknown nucleic acid CD spectra to specific structures.
Classical approaches, such as multivariate statistical analysis or other unsupervised classification methods, as shown in Supplementary Figure 1a,b, are unsuitable for establishing these reference spectra due to the structural heterogeneity and the limited availability of CD spectroscopic data for nucleic acids. One relevant factor to consider is the correct annotation of spectra, as different spectra assigned to the same structural families and a wide spectral range may exist in the literature. Also, while most spectra in the literature have been acquired in the 190 to 300 nm range, it has been demonstrated that spectral extension down to the far UV (170 nm), accessible by synchrotron radiation circular dichroism (SRCD) or with the very latest CD top bench spectrometers, is crucial for discriminating structural families (Gray et al., Reference Gray, Ratliff and Vaughan1992; Le Brun et al., Reference Le Brun, Arluison and Wien2020). Due to all these limitations, there is currently no robust method to classify nucleic acid CD spectra. To address this issue, we have developed a workflow identifying different structural classes and determining their corresponding reference spectra.
The established workflow enabled us to determine reference spectra for five well-known structures: parallel DNA quadruplexes, DNA triplexes, Z-DNA, DNA, and RNA stem loops (Sinden, Reference Sinden1994; Vanegas et al., Reference Vanegas, Hudson, Davis, Kelly, Kirkpatrick and Znosko2012). Moreover, the method’s robustness was demonstrated by correctly assigning unknown spectra (predicting their structure) to the correct spectral family and reclassifying spectra manually assigned to the incorrect family. This workflow can thus serve as a useful tool to create a list of reference spectra for nucleic acids’ various structures, akin to those existing for proteins, and to assign unknown spectra to a defined family. Due to redundancy issues and the limited number of available spectra compared to assigned structures, it is currently not possible to expand the list beyond five structures (basis spectra). However, we are convinced that this number will increase as additional spectra are published or made publicly available. In the future, a complementary approach that allows determining the number or percentage of distinct structures in larger and more complex nucleic acids could be developed.
Methods
CD data sets
The dataset utilized for developing our workflow comprises 118 spectra. Among these, 64 were sourced from the NACDDB (Cappannini et al., Reference Cappannini, Mosca, Mukherjee, Moafinejad, Sinden, Arluison, Bujnicki and Wien2023), with 59 initially acquired on the DISCO beamline of the Synchrotron SOLEIL and the other 5 originating from the literature (Gray et al., Reference Gray, Liu, Ratliff and Allen1981; Steely et al., Reference Steely, Gray and Ratliff1986; Johnson, Reference Johnson1990; Gray et al., Reference Gray, Hung and Johnson1995; Del Villar-Guerra et al., Reference Del Villar-Guerra, Trent and Chaires2018). Among the remaining 54 spectra not yet included in the NACDDB, 48 have been acquired from the DISCO beamline and 6 from literature (AI Holm et al., Reference Holm, Nielsen, Hoffmann and Nielsen2010; Vanloon et al., Reference Vanloon, Bennett, Martin, Wien, Harroun and Yan2023). All spectra in the dataset are scaled to differential molar ellipticity (Δε) following the formulae:

where θ is the circular dichroism measured in millidegrees (mdeg), MRW is the mean residue weight of the sample (g.mol−1.residue−1), PL is the path length in centimeters (cm), C is the concentration of the sample in grammes per liter (g L−1), and 3298 is a constant used for unit conversion. In all, the Δε is expressed in M−1.cm−1.residue−1.
The spectra retained were the ones including signals between 175 and 300 nm. This range presents characteristic UV absorption maxima and minima corresponding to the absorption by electronic transition of the base pairing, stacking, and overall twisting of the polynucleotide, e.g. n-> p, p-> p* as well as n-> s* (Miyahara et al., Reference Miyahara, Nakatsuji and Sugiyama2012, Reference Miyahara, Nakatsuji and Sugiyama2016). Although a clear attribution of electronic transitions within a strand of nucleic acids has not yet been established, there exists a few indications, such as the 260–280 nm CD-absorption band for the base stacking, a band around 190 nm for the backbone conformation, and another one in-between for the twisting of helical nucleic acids (AI Holm et al., Reference Holm, Nielsen, Hoffmann and Nielsen2010).
For workflow validation, a validation subset of 56 spectra, each corresponding to a well-known nucleic acid structure, was established. The list of utilized spectra and their corresponding structures is presented in Supplementary Table 1. It contains 7 families: DNA quadruplexes parallel (3 spectra), DNA triplexes (6 spectra), Z-DNA (3 spectra), DNA loops (3 spectra), RNA loops (15 spectra), DNA loops (6 spectra), and unclassified spectra (20 spectra). The latter group comprises spectra belonging to 11 other structural families but with representative spectra count per family lower than 3.
Statistical tools
Spectra normalization
All scaled spectra used in this work have been normalized to average 0 and standard deviation 1. This normalization ensures that spectra are comparable to a centered normal distribution weighing the contribution of high amplitudes that would otherwise biasing the analysis. To achieve this, we calculated the mean and standard deviation for all wavelengths of each spectrum, then subtracted that mean and divided it by the standard deviation for all wavelengths.
Self-organizing mapping
For classification methods, we employed a simple neural network known as Kohonen self-organizing maps (SOM), which has previously been used to classify nucleic acid CD spectra data (Sathyaseelan et al., Reference Sathyaseelan, Vijayakumar and Rathinavelan2021). The implementation of the SOM was performed using the MiniSom Python package (https://github.com/JustGlowing/minisom/). The neural network was customized using several parameters following recommendations from the MiniSom function package built-in help.
Multivariant statistical analysis
Multivariate analysis was conducted on the entire dataset to group the spectra into families. Initially, hierarchical clustering was employed using the Ward method and Euclidean distances between each pair of spectra. This analysis was conducted using the Python package SciPy (http://www.scipy.org/). Additionally, principal component analysis was carried out simultaneously using the SIMCA software (V17) to identify clusters and significant components for class differentiation.
Singular value decomposition
Singular value decomposition (SVD) was performed using the NumPy Python package (Harris et al., Reference Harris, Millman, van der Walt, Gommers, Virtanen, Cournapeau, Wieser, Taylor, Berg, Smith, Kern, Picus, Hoyer, van Kerkwijk, Brett, Haldane, Del Rio, Wiebe, Peterson, Gerard-Marchant, Sheppard, Reddy, Weckesser, Abbasi, Gohlke and Oliphant2020) to identify the initial references during workflow initialization.
Normalized correlation coefficient and normalized mutual information
A normalized correlation coefficient (NCC) was used to measure the linear resemblance of a family’s reference spectrum compared to all spectra of our dataset, whereas normalized mutual information (NMI) was to measure non-linear resemblance.
The NCC is defined as

where X and Y correspond to spectra value vector, x and y to spectra value at a wavelength and mx and my to the means of the spectra.
These coefficients were computed using the Python package SciPy (http://www.scipy.org/).
Each NMI was computed using the mutual information coefficient (I) defined as

where X and Y corresponding to spectra value vectors, P(X) and P(Y) the probability for the spectra to reach a certain value and P(
$ X\bigcap Y $
) the probability for both spectra to reach the same value at the same wavelength. Probability values are calculated from integer rounded spectral intensities.
Entropy (H) is defined as

the NMI was then calculated by applying the following equation:

as implemented in the scikit-learn Python package.
Workflow initialization
The workflow was initiated by manually defining structural families based on the theoretical understanding of their structures. The assignment of a spectrum to a particular family relies on the anticipated structure of an oligonucleotide sequence and the characteristics of the spectrum, including the position and intensity of its peaks when normalized. Once a structural family accumulates at least four normalized spectra (heuristically determined value), an SVD is performed on it to define an initial reference spectrum (first eigen-vector) for the family. Subsequently, the initial reference spectrum is validated by ensuring that the spectra forming the basis of the reference exhibit an NCC and an NMI whose product exceeds a threshold value. This threshold value is determined by identifying the first peak above the baseline in the derivative of this product.
Workflow validation
To assess the robustness of the workflow, three metrics were calculated: sensitivity, specificity, and similarity (Jaccard index). Each of the seven families within the validation subset was individually evaluated against the entire validation subset by running the workflow. For each run, the number of true positives, false positives, and false negatives was determined. The total count for each category was then calculated by summing the results obtained for each family. These cumulative totals were utilized to compute the value of each figure of merit.
Results
Due to the large structural heterogeneity and the yet limited availability of relevant CD spectroscopic data, the use of multivariate statistical analysis and neural networks does not produce relevant and reproducible results. Specifically, the first eigen-vectors explaining the highest percentage of variance by principal component analysis do not correspond to any spectrum having physical significance. Moreover, hierarchical classification (Supplementary Figure 1) merges spectra belonging to different structural families. Equivalent results, with inconsistent family assignments, were observed for self-organizing mapping. Therefore, we chose to combine approaches targeting two different types of information: shape similarity (by using the NCC) and probability of value occurrence (NMI).
Workflow allowing to define nucleic acids structural classes from CDs spectra
Based on NCC and NMI, we have established an iterative workflow (Figure 1) to determine the reference spectrum for each structural family. The workflow is applied to each manually defined family determined during the initialization process as follows:
-
(1) MCC and NMI values are calculated between every normalized N(0,1) spectrum from our dataset and the reference spectrum for the family.
-
(2) The product of these values (Score = NCC×NMI) is ordered from highest to lowest, thus defining the order as an abscissa (Xn) and the result of the product as an ordinate (Yn).
-
(3) The first derivative of the (Xn, Yn) array is computed to determine the position of the first inflection point.
-
(4) The coordinates of the first inflection point are used as a family belonging threshold (Figure 2).
-
(5) Spectra whose Score are above the Score at the inflection point are included in the family, regardless of whether they were part of the initial group used for the family definition.
-
(6) SVD is computed from all spectra of a family and the first component is used as the new reference spectrum for that family.
-
(7) The process is repeated from (1) until the included spectra are constant (convergence of the iterative workflow).

Figure 1. Graphical diagram of the iterative workflow. In yellow the point where data are selected, red are mathematical operation, blue the decision point and green the output.

Figure 2. Example of threshold for class assignment. (a) Plot of the correlation multiplied by the mutual information ordered from higher to lower values. Red line depicts the inflexion point where score above are spectra kept for the class. (b) First derivative of data shown in (a). Red circlet evidences the position of the inflection point used to defining the threshold shown in (a). Abscises correspond to the spectrum position ordered from higher to lower NCC and NMI product. Ordinates has no unit as it corresponds to coefficient product value.
Once convergence is reached, the first component of SVD computed from a family’s normalized spectra is set as the reference spectrum for that structural family.
Evaluation of the workflow
Once the five CD reference spectra have been determined, the robustness and accuracy of the workflow was evaluated by using a data set of 56 manually assigned spectra and standardized figures of merit as described in materials and methods. Sensibility, specificity, and Jaccard (similarity) values were 1, 0.94, and 0.94, respectively. This confirms that the workflow is robust enough to assign unknown spectra to one of the defined families. Other workflows previously described in the literature appear to be less accurate with 87.33%, 85.33%, and 78.66% for the XGBoost algorithm, neural network, and Kohonen approaches, respectively (Sathyaseelan et al., Reference Sathyaseelan, Vijayakumar and Rathinavelan2021).
Applications and workflow limits
Based on the robustness of the workflow, we have successfully defined reference spectra for five families using 118 normalized spectra, with or without initial manual assignment. These families are DNA quadruplexes parallel, DNA triplexes, Z-DNA, DNA loops, and RNA loops. The superposition of the five references (Figure 3a) allows us to identify regions (between 220 and 250 nm and between 275 and 300 nm), where the CD signal remains invariant (orange in Figure 3b). This observation holds even when the normalized spectra of the entire dataset are superposed (Figure 3c,d). It is noteworthy that, due to the limited number of available spectra in databases or published structures, we opted to apply the workflow without any discontinuity in the wavelength.

Figure 3. Comparison of the reference spectra and the dataset used. (a) reference spectra obtained after spectra value decomposition and (b) the variance of this data set at each wavelength. (c) The whole dataset showed normalized to have comparable intensities and (d) its’ associated variance at each wavelength. The orange points are the ones where chirality is invariant in the two datasets. Abscises correspond to the wavelength in nanometers (nm). Ordinates has no unit as it corresponds to normalized data or variances.
Furthermore, several spectra not initially assigned to any family were identified as belonging to one in coherence with their biological characteristics. For instance, the classification of the DNA sequence TT(GGGT)4, predicted by the workflow to belong to the quadruplex family, was confirmed by NMR (ref unpublished results, personal communication). Interestingly, of the four spectra that had been manually assigned as R-loops (orange lines in Figure 4a), two of them (dashed light and dark orange lines in Figure 4a) were rejected from that family due to the NCC and NMI product being below the determined threshold. As these 2 spectra have a similar shape, and as their corresponding sequences are compatible with DNA loops, a new reference spectrum was generated for the DNA-loops family. By running the workflow with the DNA-loops family reference across the entire dataset, we identified an additional spectrum (dashed blue line in Figure 4a), which provided validation of this new family. In summary, running the workflow allowed us to define a new reference spectrum for both the DNA loops (Figure 4b) and the R-loops (Figure 4c) families.

Figure 4. Spectra of the used for the definition of the DNA loop family. (a) The normalized spectra used to define the family. (b) The spectra of the newly defined DNA loop family. (c) The spectra defined for the R-loop family. Abscises correspond to the wavelength in nanometers (nm). Ordinates has no unit as it corresponds to normalized data or variances.
Originally designed to determine reference CD spectra, the workflow described here can also predict yet unknown secondary structures. However, it is limited to identifying elementary reference spectra from sequences with a single secondary structure. It cannot be used to determine the percentage of different structures in complex spectra from sequences with multiple secondary structures.
In summary, the workflow introduced here, with its Python code available in the supplementary data, helps identify elementary reference CD spectra for nucleic acids. Currently limited to five families, this number is expected to grow as more nucleic acid spectra are added to public reference datasets. This advancement lays the groundwork for an online tool to determine the percentage of structures in complex CD spectra, similar to existing tools for proteins. The next steps include designing this tool and developing accurate algorithms for deconvoluting complex spectra.
Open peer review
To view the open peer review materials for this article, please visit http://doi.org/10.1017/qrd.2025.10008.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/qrd.2025.10008.
Data availability statement
Python implemented script of the workflow can be downloaded from https://github.com/Sanofi-Public/CD-spectra-classification. License conditions apply, including limitation to non-commercial uses only.
Acknowledgments
We thank our Sanofi colleagues Marc François-Heude and Frédéric Greco for useful discussions and Jean-Sébastien Bolduc for critically reviewing the manuscript. We are thankful to Synchrotron SOLEIL (France) for attribution of SRCD beamtime (proposals 20201304, 20210819, and 20240053).
Author contribution
K.M. was responsible for designing and coding the workflow, compiling and acquiring spectra, evaluating spectra classification, performing critical results analysis, writing the first draft, and revising the manuscript following authors’ comments. S.V.H. proposed analytical methods implemented in the workflow, performed critical results analysis, and contributed to manuscript revisions. A.G. prepared and managed biological materials, coordinated biological aspects at Sanofi, and participated in scientific discussions. F.W. acquired spectra data, evaluated spectra classification, performed critical results analysis, and contributed to manuscript revisions. V.A. designed nucleic acid sequences for data acquisition, evaluated spectra classification, performed critical results analysis and contributed to manuscript revisions. S.M. designed the workflow, evaluated spectra classification, performed critical results analysis, wrote the first draft, and revised the manuscript following authors’ comments.
Financial support
This work was funded by Sanofi.
Competing interests
K.M., A.G., and S.M. are Sanofi employees and may hold shares and/or stock options in the company. The remaining authors declare no competing interests exist.
Comments
Dear Pr. Norden,
Thank you for your email dated 08-May-2025 and for the opportunity to revise our manuscript. We are grateful to you and the reviewers for the constructive feedback and for considering our work for publication in QRB Discovery.
We have carefully addressed the comments provided by the reviewers:
• Regarding Reviewer 1’s comment on Figure 3a: We have revised the figure to improve color differentiation and overall clarity. Specifically, we adjusted the color palette to ensure that all categories are easily distinguishable, including for readers with color vision deficiencies.
We appreciate Reviewer 2’s positive assessment and recommendation for acceptance.
We hope that the revised version addresses all remaining concerns and meets the standards for publication. Please do not hesitate to let us know if any further modifications are needed.
Thank you again for your consideration.