Introduction
Bulimia nervosa (BN) is an eating disorder (ED) characterized by recurrent binge-eating episodes and compensatory behaviors (e.g., self-induced vomiting, maladaptive exercise) and overvaluation of body shape/weight. The revised fifth edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM-5-TR; APA, 2022 includes severity specifiers for BN based on compensatory behavior frequency, defined as: mild (1–3 episodes/week), moderate (4–7 episodes/week), severe (8–13 episodes/week), and extreme (≥14 episodes/week). While these severity levels ideally would suggest intervention targets and represent the degree of functional impairment, the DSM-5-TR severity scheme was not empirically validated. Thus, it is unknown: (1) whether compensatory behavior frequency is the most appropriate metric by which to define BN severity and (2) what are the precise levels of compensatory behaviors or a more appropriate metric that most accurately differentiates BN severity groups.
A systematic review and meta-analysis found that compensatory behavior frequency provides some clinical utility as a severity specifier for BN but also found support for alternative severity rating schemes (weight/shape overvaluation and drive for thinness) (Dang, Giles, Fuller-Tyszkiewicz, Kiropoulos, & Krug, Reference Dang, Giles, Fuller-Tyszkiewicz, Kiropoulos and Krug2022). While Dang et al. (Reference Dang, Giles, Fuller-Tyszkiewicz, Kiropoulos and Krug2022) found some support for the DSM-5 BN severity specifiers, their review and meta-analysis focused on ED psychopathology as a validator of the specifiers, and other variables across domains (e.g., general psychopathology, quality of life) may also be important in validating BN severity specifiers. Some other work has examined the validity and utility of the current DSM-5 BN severity specifiers by assessing whether DSM-5-defined severity groups differed on depression, anxiety, quality of life, and physical health. To our knowledge, three studies have compared BN severity groups on depression, with two finding no differences (Gianini et al., Reference Gianini, Roberto, Attia, Walsh, Thomas, Eddy, Grilo, Weigel and Sysko2017; Smith et al., Reference Smith, Ellison, Crosby, Engel, Mitchell, Crow, Peterson, Le Grange and Wonderlich2017) and one finding differences between groups (Grilo, Ivezaj, & White, Reference Grilo, Ivezaj and White2015). Additionally, Smith et al. (Reference Smith, Ellison, Crosby, Engel, Mitchell, Crow, Peterson, Le Grange and Wonderlich2017) found that DSM-5-defined BN severity levels differed on anxiety and quality of life but not physical health. Taken together, the validity of the BN specifiers in their current form is still unclear.
Alternative variables to classify ED severity have been explored, including use of multiple purging methods for BN (Gianini et al., Reference Gianini, Roberto, Attia, Walsh, Thomas, Eddy, Grilo, Weigel and Sysko2017), drive for thinness (Krug et al., Reference Krug, Binh Dang, Granero, Agüera, Sánchez, Riesco, Jimenez‐Murcia, Menchón and Fernandez‐Aranda2021), and shape/weight overvaluation in anorexia nervosa (Billman Miller et al., Reference Billman Miller, Abber, Hamilton, Ortiz, Jacobucci, Essayli, Smith and Forrest2025), binge-eating disorder (Forrest, Jacobucci, & Grilo, Reference Forrest, Jacobucci and Grilo2022), and transdiagnostically (Gianini et al., Reference Gianini, Roberto, Attia, Walsh, Thomas, Eddy, Grilo, Weigel and Sysko2017). Gianini et al. (Reference Gianini, Roberto, Attia, Walsh, Thomas, Eddy, Grilo, Weigel and Sysko2017) explored the number of purging methods, rather than the frequency of purging behaviors, to classify severity. They found that defining severity based on the number of purging methods was more strongly associated with psychopathology than compensatory behavior frequency. Krug et al. (Reference Krug, Binh Dang, Granero, Agüera, Sánchez, Riesco, Jimenez‐Murcia, Menchón and Fernandez‐Aranda2021) found that drive for thinness may be a useful transdiagnostic severity specifier with greater clinical utility than compensatory behavior frequency. Gianini et al. (Reference Gianini, Roberto, Attia, Walsh, Thomas, Eddy, Grilo, Weigel and Sysko2017) explored clinically significant shape/weight overvaluation as a severity grouping, and found that this severity scheme outperformed the DSM-based severity specifiers. Similarly, Jenkins, Luck, Cardy, and Staniford (Reference Jenkins, Luck, Cardy and Staniford2016) found that classification of BN severity based on the DSM-5 indicators (frequency of compensatory behaviors) was more accurate when considered in the presence of shape/weight overvaluation.
Shape/weight overvaluation may be a valid, clinically useful metric of severity for BN. Importantly, shape/weight overvaluation can take on different meanings in different contexts, as it can be considered a symptom of BN or a mechanism specific to enhanced cognitive behavior therapy for EDs (Cooper & Fairburn, Reference Cooper and Fairburn2010). In the context of this study, we refer to shape/weight overvaluation as a theory-agnostic representation of the DSM-5’s BN criterion D (“self-evaluation is unduly influenced by body shape and weight,” APA, 2022). Prior work (Forrest, Jones, Ortiz, & Smith, Reference Forrest, Jones, Ortiz and Smith2018; Gianini et al., Reference Gianini, Roberto, Attia, Walsh, Thomas, Eddy, Grilo, Weigel and Sysko2017; Grilo et al., Reference Grilo, Ivezaj and White2015; Ojserkis, Sysko, Goldfein, & Devlin, Reference Ojserkis, Sysko, Goldfein and Devlin2012) evaluating the utility of shape/weight overvaluation as an alternative severity specifier has relied on a binary scheme, where individuals either do or do not experience clinically significant overvaluation. Given that shape/weight overvaluation is a continuous construct, it is possible that there may be more than one meaningful overvaluation cutpoint that differentiates BN severity. Thus, exploring shape/weight overvaluation beyond a binary operationalization is critical to advance our understanding of this symptom as a potential metric of BN severity.
Structural equation modeling trees (SEM Trees), which empirically determine specific thresholds of continuous variables that best differentiate groups (Brandmaier, von Oertzen, McArdle, & Lindenberger, Reference Brandmaier, von Oertzen, McArdle and Lindenberger2013), may be particularly useful in answering outstanding questions about what symptoms best reflect BN severity and what cutpoint levels are most valid. SEM Trees can accommodate more than one variable on which to determine cutpoints (Brandmaier et al., Reference Brandmaier, von Oertzen, McArdle and Lindenberger2013), allowing for comparison of how different variables contribute to groupings (e.g., compensatory behavior frequency and shape/weight overvaluation). SEM Trees contain confirmatory and exploratory components, where a confirmatory outcome model is based on theory or evidence; an exploratory decision tree uses covariates to split the data into subgroups that best fit the outcome model (Brandmaier et al., Reference Brandmaier, von Oertzen, McArdle and Lindenberger2013). SEM Trees have been used to empirically determine ED severity for patients with binge-eating disorder (Forrest et al., Reference Forrest, Jacobucci and Grilo2022), other specified feeding or eating disorder (Ortiz, Forrest, Kinkel-Ram, Jacobucci, & Smith, Reference Ortiz, Forrest, Kinkel-Ram, Jacobucci and Smith2021), and anorexia nervosa (Billman Miller et al., Reference Billman Miller, Abber, Hamilton, Ortiz, Jacobucci, Essayli, Smith and Forrest2025). These studies modeled a latent ED severity variable comprising ED symptoms and comorbid symptoms (e.g., depression symptoms, anxiety symptoms, number of psychiatric comorbidities) and used SEM Trees to identify whether shape/weight overvaluation was an appropriate (and/or superior to existing DSM-5 severity specifiers, for AN and BED) metric of severity. Past work (Billman Miller et al., Reference Billman Miller, Abber, Hamilton, Ortiz, Jacobucci, Essayli, Smith and Forrest2025; Forrest et al., Reference Forrest, Jacobucci and Grilo2022; Ortiz et al., Reference Ortiz, Forrest, Kinkel-Ram, Jacobucci and Smith2021) has included comorbid symptoms as part of the latent outcome model given high comorbidity between AN, BED, and OSFED and depression and anxiety symptoms. Similarly, there is high comorbidity between these symptoms and BN in both adolescent (Herpertz-Dahlmann, Reference Herpertz-Dahlmann2015) and adult (Bulik, Reference Bulik, Fairburn and Brownell2002; Godart et al., Reference Godart, Perdereau, Rein, Berthoz, Wallier, Jeammet and Flament2007) populations. Across studies, two key findings have been observed: (1) SEM Trees identified multiple severity subgroups defined by increasing levels of (continuously modeled) shape/weight overvaluation (though see Ortiz et al., Reference Ortiz, Forrest, Kinkel-Ram, Jacobucci and Smith2021, where only two severity subgroups were identified) and (2) SEM Tree groupings explained approximately 2 times and 20 times more variance in clinical characteristics (ED and comorbid symptoms) than DSM-5 severity groupings in BED and AN, respectively. No studies to our knowledge have utilized SEM Trees or other data-driven methods to empirically identify severity for BN.
Thus, we had three overarching aims: (1) determine optimal cutpoint values for severity levels in BN using compensatory behavior frequency and/or shape/weight overvaluation; (2) compare whether compensatory behavior frequency or shape/weight overvaluation accounted for more severity cutpoints in BN; and (3) compare the extent to which SEM Tree-defined severity groups, DSM-5 severity groups, and alternative severity groupings (clinically significant shape/weight overvaluation, >1 purging method) differentiate people with BN based on clinical characteristics. We created a latent outcome model of BN severity based on ED, depression, and anxiety symptoms, consistent with prior work. We did not have specific hypotheses for the exact levels of compensatory behaviors and/or shape/weight overvaluation that would differentiate groups, as analyses were exploratory in nature. We expected, based on prior work, that shape/weight overvaluation levels and the use of multiple purging methods would outperform the DSM-5 severity indicators based on compensatory behavior frequency.
Method
Participants
Participants (76% White, 99% women, M age = 24.71, SDage = 9.22; Table 1) receiving treatment for BN in residential (n = 770), partial hospitalization program (PHP; n = 192), and outpatient care (n = 55) were recruited from 2014 to 2021. The residential and PHP centers provided treatment only to women and girls, while all genders were served in the outpatient center. Data were collected from two residential sites, 20 PHP sites, and one outpatient site. Some participants were transferred between levels of care. When participants were present in more than one dataset, data were used for only their first admission. Data were combined across levels of care to increase the range of ED severity in our sample.
Table 1. Gender, race, and age distributions for SEM Tree-derived groups

Our initial sample size was N = 1064. Participants were excluded if: (1) they had missing data on all measures (n = 31); (2) their EDE-Q Global score was <0.5 (n = 10) and they were in residential or PHP treatment, given that this is an indicator of invalid responding at these levels of care (Thompson-Brenner et al., Reference Thompson-Brenner, Boswell, Espel-Huynh, Brooks and Lowe2019); or (3) reported no core diagnostic indicators of BN on the EDE-Q or had missing data for all of these items (n = 6; see Supplemental Material). These exclusions resulted in an analytic sample of n = 1017.
Procedures
Participants completed self-reports assessing cognitive ED, anxiety, and depression symptoms prior to beginning treatment. BN diagnoses were assigned at intake via unstructured clinical interview by a master’s level clinician and confirmed by the staff psychiatrist at admission. Interrater reliability data were not available. All participants provided informed consent for their data to be used for research and procedures were approved by the Pennsylvania State University Institutional Review Board (#00019147) as nonhuman subjects research due to the present analyses being secondary data analysis.
Measures
Outcome model: latent BN severity
Latent BN severity was modeled using cognitive BN symptoms, depression symptoms, and anxiety symptoms, where all factor loadings were constrained to equally represent the latent factor. We tested a model with unequal factor loadings, but estimation problems occurred. Depression and anxiety symptoms were selected as indicators given high comorbidity between these symptoms and BN in both adolescent (Herpertz-Dahlmann, Reference Herpertz-Dahlmann2015) and adult (Bulik, Reference Bulik, Fairburn and Brownell2002; Godart et al., Reference Godart, Perdereau, Rein, Berthoz, Wallier, Jeammet and Flament2007) populations, suggesting that BN is unlikely to occur in isolation without depression or anxiety. Further, research suggests anxiety and depression are related to higher levels of ED symptom severity and poorer outcome (Sander, Moessner, & Bauer, Reference Sander, Moessner and Bauer2021).
Cognitive ED symptoms. We assessed cognitive ED symptoms using 22 items from the EDE-Q (Fairburn & Beglin, Reference Fairburn and Beglin2008). Items are scored on a 0- to 6-point Likert scale and combined into a composite Global score (range: 0–6), with higher scores reflecting more severe symptoms. We calculated Global scores without the overvaluation items to minimize overlap between the variables by which severity may be defined (overvaluation) and the variables on which severity subgroups would be compared (EDE-Q Global score). Internal consistency was excellent across samples (αs = .92–.94).
Depression symptoms. We assessed depression symptoms using the Center for Epidemiological Studies Depression Scale (CESD, used in the residential and PHP samples; (Radloff, Reference Radloff1977) or the Patient Health Questionnaire-9 (PHQ-9, used in the outpatient sample; (Kroenke, Spitzer, & Williams, Reference Kroenke, Spitzer and Williams2001). The CESD comprises 20 items on a 0 (rarely or none of the time) to 3 (most or almost all the time) point Likert scale. Items are summed to a total score (range: 0–60), with higher scores representing more severe depression. The PHQ-9 comprises 9 items on a Likert scale ranging from 0 (not at all) to 3 (nearly every day). Items are summed to a total score (range: 0–27), with higher scores representing more severe depression. Internal consistency was excellent on the CESD (αs = .90). We were unable to calculate internal consistency for the PHQ-9, as itemized data were not available.
Anxiety symptoms. We assessed anxiety symptoms using the Overall Anxiety Severity and Intensity Scale (OASIS, used in the residential and PHP samples; (Campbell-Sills et al., Reference Campbell-Sills, Norman, Craske, Sullivan, Lang, Chavira, Bystritsky, Sherbourne, Roy-Byrne and Stein2009) or the Generalized Anxiety Disorder-7 (GAD-7, used in the outpatient sample; (Spitzer, Kroenke, Williams, & Löwe, Reference Spitzer, Kroenke, Williams and Löwe2006). The OASIS comprises five items on a 0–4 point Likert scale. Items are summed to a total score (range: 0–20), with higher scores representing higher anxiety. The GAD-7 comprises seven items on a Likert scale ranging from 0 to 4. Items are summed to a total score (range: 0–28). Internal consistency was excellent on the OASIS (αs = .86–.87). We were unable to calculate internal consistency for the GAD-7, as itemized data were not available.
Model covariates
We included two covariates (i.e., the variables used to partition data into subgroups) in the model: severity of shape/weight overvaluation and compensatory behavior frequency. Compensatory behavior frequency was defined broadly as self-induced vomiting, laxative or diuretic use, and excessive exercise. Shape/weight overvaluation was assessed using a composite of two strongly correlated EDE-Q items (“Has your weight influenced how you think about (judge) yourself as a person?” and “Has your shape influenced how you think about (judge) yourself as a person?”; r = 0.80, p < .001). Compensatory behavior frequency was the sum of the EDE-Q items assessing frequency of self-induced vomiting, laxative or diuretic use, and excessive exercise. To control for outliers, each behavioral frequency item was capped at 56 prior to summing into the composite variable, as this translates to 2 instances per day for 28 days (i.e., the closest approximation to the DSM-5 definition of extreme BN severity).
Statistical analysis
We z-scored all outcome model indicators so that constructs assessed using different measures (e.g., depression was assessed in the residential and PHP samples using the CESD but in the outpatient sample using the PHQ-9) could be combined into a single score on a common scale. Although cognitive ED symptoms were assessed using the same measure across samples, we also z-scored EDE-Q Global scores (within each level of care), so that all outcome model indicators had a common scale.
SEM Trees. Since there were only three indicators in the confirmatory BN severity outcome model, model fit of the latent severity variable could not be assessed. The exploratory decision tree recursively separated data into subgroups that explained the maximum variance in the outcome model; groups were based on values (splits) of the covariates: shape/weight overvaluation and compensatory behavior frequency. We used a ‘fair’ splitting criterion (Brandmaier et al., Reference Brandmaier, von Oertzen, McArdle and Lindenberger2013), where the sample is randomly divided in two equal parts, to control for the number of response options in the covariates. The outcome model is compared at every possible value of the covariates in the first part, and the value resulting in the largest model fit improvement is selected. That split value is then evaluated in the second part of the sample. A retained split indicates a cutpoint that differentiates severity subgroups (see Figure 1). Given that SEM Trees accommodate multiple covariates, we included both compensatory behavior frequency and shape/weight overvaluation in the same model, as this allows us to directly compare each variable’s contribution to model fit.

Figure 1. Decision tree with splits based on shape/weight overvaluation and compensatory behaviors.
Note. LR = likelihood ratio; resid1 = residual variance of cognitive ED symptoms; resid2 = residual variance of depression, resid3 = residual variance of anxiety. m1 = manifest mean of cognitive ED symptoms, m2 = manifest mean of depression, m3 = manifest mean of anxiety. Manifest means are the means of each variable used in the latent outcome model. Residual variance is unexplained variance in indicators that are not explained by the latent BN severity variable.
SEM forests. SEM Trees are singular trees with inherent variability that could drive splits and therefore bias conclusions regarding which covariate is the superior indicator of BN severity. SEM Forests can mitigate this issue by estimating multiple individual SEM Trees and evaluating within a “forest” of trees which variable is a better overall indicator of severity. We used SEM Forests to estimate 100 individual trees and compared one variable per split (Brandmaier, Prindle, McArdle, & Lindenberger, Reference Brandmaier, Prindle, McArdle and Lindenberger2016). The forest derives an importance parameter for each covariate, indicating the relative strength of shape/weight overvaluation versus compensatory behavior frequency in improving model fit.
Planned comparisons. We used ANOVAs with orthogonal, planned contrasts comparing lower SEM Tree severity groups to higher SEM Tree severity groups (e.g., Group 1 versus Groups 2–5) on clinical characteristics. Partial eta squared was used to indicate overall effect sizes. Cohen’s d was calculated to index contrast effect sizes.
In addition to the SEM Forests described above to test whether compensatory behavior frequency or shape/weight overvaluation was the superior metric of severity within our SEM Tree model, we assessed whether SEM Tree groups outperformed existing severity indicators by comparing clinical characteristics using three other severity subgrouping schemes: (1) DSM-5 levels; (2) clinically significant shape/weight overvaluation (either item rated as ≥4); and (3) single versus multiple purging methods. These comparisons were made using ANOVAs with orthogonal, planned contrasts, and t tests. Effect sizes were calculated for all comparisons. For each severity subgrouping scheme, we did not compare clinical characteristics that were used to define severity subgroups (e.g., did not compare compensatory behavior frequency across DSM-5 groups given that DSM-5 groups are defined by compensatory behavior frequency). Effect sizes were also descriptively compared for each severity classification scheme.
Normality, assumptions, and missing data. We inspected normality for all variables. Levene’s test was used to determine whether equal variances could be assumed. We report Welch-corrected ANOVAs and contrasts where indicated.
The maximum amount of missingness for all measures except the anxiety variable was 5.0%. The anxiety variable was missing 35% of responses. Much of the missingness (97%) is a result of the PHP and residential treatment facilities not adding in the OASIS to the intake battery until approximately 2 years after they began administering the EDE-Q and CESD. Participants admitted before versus after the inclusion of the OASIS did not differ on EDE-Q global scores or shape/weight overvaluation, although they did report higher depression and greater compensatory behavior frequency (Supplemental Table 2). However, all means were within 1 SD of one another, and effect sizes were small or small–moderate. Missing data for the outcome model were handled with full information maximum likelihood. Missing data for the comparisons among severity subgroups were handled using pairwise deletion.
Results
SEM tree and SEM forest. Four splits were identified, creating five severity subgroups (Figure 1). Group 1 was people with overvaluation <1.25 (n = 34). Group 2 was people with overvaluation 1.25–3.74 (n = 109). Group 3 was people with overvaluation 3.75–4.74 (n = 111). Group 4 was people with overvaluation 4.75–5.74 (n = 177). Group 5 was people with overvaluation ≥5.75 (n = 582). No splits occurred at any values of compensatory behaviors. SEM Forests found overvaluation resulted in 263.4 units of improvement in model fit (−2 log likelihood), whereas compensatory behavior frequency resulted in only 23.5 units of improvement in model fit.
ANOVAs showed SEM Tree groups differed on ED symptoms, depression, anxiety, and binge-eating frequency with medium-to-large effect sizes (Table 2; all ps < .01). Planned contrasts (Table 3) indicated that cognitive ED symptoms and depression increased as the level of overvaluation increased (all ps < .001). Exceptions to this pattern were found for anxiety and binge-eating frequency. Although anxiety was different for SEM Tree contrasts 1, 2, and 3, where lower severity groups had lower anxiety than higher severity groups (all ps < .001), anxiety did not differ between SEM Tree Groups 4 versus 5 (p = .34). No contrasts were significant for binge-eating frequency (all ps > .05).
Table 2. Clinical characteristics compared among the structural equation model tree-derived groups

a Unequal variances across groups;
b Presented for descriptive purposes. We did not statistically compare overvaluation across groups, as this was the metric by which groups were empirically defined.
For the level of care, percentages indicate the percent within each level of care within each severity group (i.e., rows sum to 100). n differs for anxiety as a result of missing data. Binge-eating frequency = weekly binge-eating frequency; ED = eating disorder.
Table 3. Structural equation model tree-derived groups’ contrast results

Note. Binge-eating frequency = weekly binge-eating frequency; ED = eating disorder.
a Significant heterogeneity of variance and thus Welch-corrected contrasts are reported. Comparisons are not presented for overvaluation, as overvaluation was the primary metric by which SEM Tree groups were defined.
DSM-5 severity specifiers. ANOVAs indicated DSM-5 groups differed on overvaluation, cognitive ED symptoms, depression, and binge-eating frequency (all ps < .001) with medium effect sizes but did not differ on anxiety (Table 4; p = .19). As expected, contrasts (Table 5) indicated all DSM-5 groups differed on binge-eating frequency, such that increased DSM-5 severity grouping was associated with greater binge-eating frequency (all ps < .001). Moderate through extreme groups had higher shape/weight overvaluation, cognitive ED symptoms, and depression than the mild group (all ps < .001) but similar levels of anxiety (p = .15). Severe and extreme groups had higher shape/weight overvaluation, cognitive ED symptoms, and depression than the moderate group (all ps < .05) but similar levels of anxiety (p = .55). Severe and extreme groups did not differ on any symptoms (all ps > .05).
Table 4. Group comparisons of demographic and clinical characteristics for Diagnostic and Statistical Manual for Mental Disorders-5-specified severity indicators.

Note. Comp. bx frequency = weekly compensatory behavior frequency; overvaluation = shape and weight overvaluation; binge eating = weekly binge-eating frequency; ED = eating disorder.
a Presented for descriptive purposes. We did not statistically compare compensatory behavior frequency across groups, as it was the metric by which DSM-5 groups were defined
b Significant heterogeneity of variance and thus Welch-corrected contrasts are reported.
Table 5. Contrast results for Diagnostic and Statistical Manual for Mental Disorders-5-specified severity indicators.

Note. Comp. bx frequency = weekly compensatory behavior frequency; overvaluation = shape and weight overvaluation; binge eating = weekly binge-eating frequency; ED = eating disorder.
a Significant heterogeneity of variance and thus Welch-corrected contrasts are reported.
Clinical overvaluation. People with clinical overvaluation had higher compensatory behavior frequency, cognitive ED symptoms, depression, anxiety, and binge-eating frequency compared to those with nonclinical overvaluation (all ps < .01), with medium-to-large effect sizes (Table 6).
Table 6. Comparisons of clinical characteristics based on shape/weight overvaluation clinical threshold of 4.

Note. binge eating = weekly binge-eating frequency; ED = eating disorder.
a Presented for descriptive purposes. We did not statistically compare shape/weight overvaluation across groups, as it was the metric by which these groups were defined.
b Significant heterogeneity of variance and thus Welch-corrected contrasts are reported.
Single versus multiple purging. People who used multiple purging methods had higher compensatory behavior frequency, overvaluation, cognitive ED symptoms, and depression (all ps < .001) with small-to-medium effect sizes but did not differ on anxiety (p = .15) or binge-eating frequency (Table 7; p = .44).
Table 7. Comparisons of clinical characteristics based on single versus multiple purging methods.

Note. binge eating = weekly binge-eating frequency; ED = eating disorder.
a Presented for descriptive purposes. We did not statistically compare shape/weight overvaluation across groups, as it was the metric by which these groups were defined.
b Significant heterogeneity of variance and thus Welch-corrected contrasts are reported.
Variance in clinical characteristics explained by severity schemes. 1–51% (M = 21.8%) of the variance in clinical characteristics was explained by SEM Tree-derived subgroups, 2–9% (M = 5.2%) by DSM-5 subgroups, 1–39% (M = 13.4%) by clinical overvaluation, and 0.1–12% (M = 3.4%) by single versus multiple purging methods. Thus, SEM Tree groups explained 4.19 times the variance explained by the DSM-5 groups, 1.63 times the variance explained by the clinical overvaluation groups, and 6.41 times the variance explained by the purging method groups.
Discussion
This study identified the compensatory behavior frequencies and/or shape/weight overvaluation levels that differentiated BN severity, determined whether compensatory behavior frequency or shape/weight overvaluation was the superior indicator of BN severity, and compared how SEM Tree-defined severity groups differentiated intensity of several clinical characteristics compared to the DSM-5 severity grouping and alternative severity schemes. The SEM Tree identified five levels of severity, all based on shape/weight overvaluation. Since the overvaluation composite was an average of two items, all individual participant composite scores were necessarily between 0 and 6 with 0.5-level increments possible. Thus, since not all cutpoints are possible values for individual participants, we round these numbers from this point forward to facilitate the practical implications of these findings. The first split was between people with overvaluation composite scores of 1 or less. The second split was between people with overvaluation composite scores between 1.5 and 3.5. The third split was between people with overvaluation composite scores between 4 and 4.5. The fourth split was people with overvaluation composite scores above 5. The highest-severity group was the largest in size (57%), which could mean that this group is the most common clinical presentation or may reflect that most participants were receiving treatment at a higher level of care. Within the SEM Forest, shape/weight overvaluation contributed to more improvement in model fit (263.4 units) than compensatory behavior frequency (23.5 units). When examining group differences, SEM Tree-defined severity groups explained 4.2 times the variance in clinical characteristics explained by DSM-5-defined groups, 1.6 times the variance explained by clinical overvaluation groups, and 6.4 times the variance explained by purging method groups. Altogether, findings indicate that relative to other currently proposed BN severity specification schemes, shape/weight overvaluation – when modeled beyond a binary operationalization – is the strongest marker of BN severity. While further investigation of the identified overvaluation cutpoints is warranted, our findings call for reconsideration of the current DSM-5 classification scheme.
Our results are consistent with prior work finding limited support for the DSM-5’s BN severity specifiers (Gianini et al., Reference Gianini, Roberto, Attia, Walsh, Thomas, Eddy, Grilo, Weigel and Sysko2017; Gorrell et al., Reference Gorrell, Hail, Kinasz, Bruett, Forsberg, Delucchi, Lock and Le Grange2019; Grilo et al., Reference Grilo, Ivezaj and White2015) and with broader findings in the field that data-driven judgments outperform clinical or expert judgments (Dawes, Faust, & Meehl, Reference Dawes, Faust, Meehl, Keren and Lewis1993; Meehl, Reference Meehl1954). Although DSM-5-defined groups differed on overvaluation, cognitive ED symptoms, depression, and binge-eating frequency, the pattern of results was inconsistent and nonlinear for almost all clinical variables. Moreover, effect sizes for DSM-5 groups were quite small, explaining a maximum of only 9% of the variance in clinical characteristics.
On balance, one reason why SEM Tree groupings and clinically significant overvaluation explained more variance in clinical characteristics than DSM-5 groupings or single versus multiple purging methods could be measurement bias, where items/measures assessing cognitions correlate more strongly with other items/measures of cognitions, as compared to items/measures assessing behaviors. While measurement bias is a possibility, such measurement bias would have equally impacted both the SEM Tree groupings and the clinically significant overvaluation groupings. Thus, while we cannot rule out or statistically control for measurement bias, even if measurement bias is held constant, SEM Tree groupings still emerge as the stronger severity specification scheme.
Behavioral symptoms, like compensatory behaviors, have long been thought of as hallmark ED symptoms. Thus, the facts that (1) shape/weight overvaluation was superior to compensatory behavior frequency in determining SEM Tree groupings and (2) SEM Tree groups outperformed DSM-5 severity groupings may lead some to wonder whether shape/weight overvaluation is more important to BN than compensatory behavior frequency, and whether clinicians should decrease focus on reducing compensatory behaviors. We do not draw this conclusion. Although emerging data, including our findings, suggests cognitive symptoms are more central to EDs and outperform behavioral symptoms in defining ED severity (Billman Miller et al., Reference Billman Miller, Abber, Hamilton, Ortiz, Jacobucci, Essayli, Smith and Forrest2025; Forrest et al., Reference Forrest, Jacobucci and Grilo2022), we do not suggest neglecting treatment of behavioral symptoms. Behavioral symptoms like compensatory behaviors can have severe medical consequences (Casiero & Frishman, Reference Casiero and Frishman2006; Mitchell, Seim, Colon, & Pomeroy, Reference Mitchell, Seim, Colon and Pomeroy1987) and are considered to maintain BN. Instead, we suggest that we may need to expand our conceptualization of “hallmark ED symptoms” beyond behavioral symptoms, and consider measuring and targeting shape/weight overvaluation early in treatment in addition to compensatory behaviors.
Although SEM Trees outperformed all other severity specification schemes, the pattern and extent to which SEM Trees captured group differences varied across clinical characteristics. SEM Tree groups differed on cognitive ED and depression symptoms, such that cognitive ED and depression symptoms increased as the level of shape/weight overvaluation increased. Anxiety symptoms followed a similar trend, although the difference between SEM Tree groups 4 and 5 was not statistically significant. Other research has also found nonlinear associations between anxiety and cognitive ED symptoms. For instance, in a sample of people with anorexia nervosa, Haynos et al. (Reference Haynos, Roberto and Attia2015) found that anxiety and cognitive ED symptoms were associated only among people with low emotion regulation difficulties. We found that binge eating frequency was similar across all SEM Tree groups, and no planned contrasts for this variable were statistically significant. Aligning with our findings, prior work in binge-eating disorder has found that binge-eating frequency is not a strong indicator of ED severity (Forrest, Smith, & Swanson, Reference Forrest, Smith and Swanson2017; Grilo et al., Reference Grilo, Ivezaj and White2015). Similar levels of binge eating across SEM Tree groups could be a byproduct of binge-eating frequency not being a strong correlate of ED severity.
Study strengths include a large sample comprising both adolescents and adults, strengthening the generalizability of these findings across the lifespan. Study limitations are as follows. First, all data reflect patients’ self-reported symptoms at a single timepoint. The lack of follow-up data means that we were unable to assess whether our SEM Tree groupings have predictive validity. Second, while the variables used as indicators for the latent BN severity model are evidence-based indices of severity, other unavailable variables (e.g., medical complications; Forney et al., Reference Forney, Buchman-Schmitt, Keel and Frank2016) are also important to consider when defining severity. Third, different measures of anxiety and depression were used across levels of care; we z-scored these measures to account for this.
Fourth, self-report measures like those used to define groups have several notable limitations. Participants could provide untruthful responses, withhold information, or be unaware of their experiences. This leads some to suggest that non-self-report metrics, such as implicit measures or physiological metrics, may be superior metrics of ED severity. However, we disagree with this conclusion, for several reasons. (1) Implicit measures have long been seen as circumventing the potential for self-report measures to be answered untruthfully. However, recent work actually suggests that self-report measures have better reliability and predictive utility than implicit measures (Corneille & Gawronski, Reference Corneille and Gawronski2024). (2) Our perception of the literature is that there is not consensus that physiological metrics are superior to self-report metrics when determining eating disorder severity. For example, weight is a relatively “physiological” metric that has for many decades been used (either formally or informally) to signify anorexia nervosa severity. However, empirical examinations of BMI as a severity metric in anorexia nervosa yield limited support for this severity operationalization (Billman Miller et al., Reference Billman Miller, Abber, Hamilton, Ortiz, Jacobucci, Essayli, Smith and Forrest2025): there are many quite ill individuals with anorexia nervosa with very low BMI, yet also many quite ill individuals with anorexia nervosa with not extremely low BMI (though still underweight). (3) Assessing physiological metrics is not within the scope of practice for the vast majority of healthcare professionals who treat people with EDs (social workers, licensed professional counselors, and psychologists). (4) In many (though not all) cases, affected individuls’ reports of their experiences provide valuable information. In fact, recent work has called for the ED field to incorporate self-report measures (e.g., shape/weight overvaluation and fear of weight gain) into models to inform diagnosis and potential DSM revisions (Hagan & Christensen Pacella, Reference Hagan and Christensen Pacella2024). Importantly, self-report measures are accessible and can be administered in a variety of settings, which would allow for our SEM Tree-derived severity groupings (or other schemes that may be developed or which prove to be superior) to be widely adopted.
There are also multiple sample-specific limitations. First, the outpatient sample was much smaller than residential and PHP samples, skewing our data toward those with higher severity. Second, diagnoses were assigned through unstructured clinical interviews that are specific to the clinics that provided data, and interrater reliability data are unavailable. Third, the sample was mostly White cisgender women, and data to capture socioeconomic status and education level were unfortunately unavailable. Limited sample diversity is of particular concern given evidence of bias in machine learning algorithms (Huang, Galal, Etemadi, & Vaidyanathan, Reference Huang, Galal, Etemadi and Vaidyanathan2022). Furthermore, cognitive ED symptoms may manifest differently in males and minoritized racial and ethnic groups (Bucchianeri et al., Reference Bucchianeri, Fernandes, Loth, Hannan, Eisenberg and Neumark-Sztainer2016), highlighting the need to validate our findings in other samples. Finally, given our use of a treatment-seeking sample, we must recognize differences between people with EDs who do versus do not seek treatment (Forrest et al., Reference Forrest, Smith and Swanson2017). Results therefore may not be representative of all people with EDs. We must also recognize the systemic barriers that exist for many people in seeking and receiving ED treatment. These barriers disproportionately impact people belonging to marginalized groups. The exclusion of marginalized groups from ED research and treatment is a significant challenge and limitation for our field (Egbert, Hunt, Williams, Burke, & Mathis, Reference Egbert, Hunt, Williams, Burke and Mathis2022; Goel et al., Reference Goel, Jennings Mathis, Egbert, Petterway, Breithaupt, Eddy, Franko and Graham2022), where much work is needed to advance health equity.
Next steps for evaluating the utility of our SEM Tree groupings include replicating these results, comparing SEM Tree groups on relevant physiological variables, such as electrolyte imbalances (Mitchell et al., Reference Mitchell, Seim, Colon and Pomeroy1987) and cardiovascular consequences associated with purging (Casiero & Frishman, Reference Casiero and Frishman2006), and assessing predictive validity. Additional research should test whether other variables or combinations of variables result in severity schemes with better validity while maximizing parsimony and clinical utility.
In sum, this study empirically defined BN severity and tested whether the empirically defined severity specification scheme outperformed existing BN severity classification schemes. We found five severity subgroups, all defined by increasing levels of shape/weight overvaluation. Our empirically defined classification scheme explained more variance in clinical characteristics than all other BN severity specification schemes, including the current DSM-5 severity groupings. Findings suggest reconsideration of the current DSM-5 BN severity definitions, and that (continuously modeled) overvaluation warrants consideration as a primary metric by which to define BN severity. As the ultimate goal of severity specifiers is to suggest targets for intervention, it is critical for future research to test the validity of these cutpoints in predicting BN course and treatment outcomes.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/S0033291725100597.
Acknowledgments
We thank The Renfrew Center for the use of these data.
Author contribution
LNF and SNO conceptualized the study. LNF and RCJ were responsible for data curation and conducted formal analyses. SRA, LNF and MGBM wrote the initial draft of the manuscript. LNF, SRA, MGBM, AH, SNO, RCJ, JHE, TEJ, and ARS reviewed and edited the manuscript. LNF provided supervision.
Funding statement
This project was supported by the National Center for Advancing Translational Sciences (KL2 TR002015 and UL1 TR002014; JHE and LNF) and the National Institute of Minority Health and Health Disparities (K08MD019314; LNF). The content described is the responsibility of the authors and does not necessarily represent the official views of the NIH.
Competing interests
The authors have no conflicts of interest to disclose.
Ethical standard
The authors assert that all procedures contributing to this work comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the Helsinki Declaration of 1975, as revised in 2008.