Skip to main content Accessibility help
×
Hostname: page-component-6bb9c88b65-kfd97 Total loading time: 0 Render date: 2025-07-26T15:19:10.050Z Has data issue: false hasContentIssue false

Chapter 1 - Introduction

Published online by Cambridge University Press:  13 May 2021

Craig S. Wells
Affiliation:
University of Massachusetts, Amherst

Summary

The concept underlying measurement invariance is often introduced using a metaphoric example via physical measurements such as length or weight (Millsap, 2011). Suppose I developed an instrument to estimate the perimeter of any object. My instrument is invariant if it produces the same estimate of the object’s perimeter, regardless of the object’s shape. For example, if my instrument provides the same estimate of the perimeter for a circle and a rectangle that have the same true perimeter, then it is invariant. However, if for a circle and a rectangle of the same true perimeter my measure systematically overestimates the perimeters of rectangles, then my measure is not invariant across objects. The object’s shape should be an irrelevant factor in that my instrument is expected to provide an accurate estimate of the perimeter, regardless of the object’s shape. However, when we have a lack of measurement invariance, the estimated perimeter provided by my instrument is influenced not only by the true perimeter but also by the object’s shape. When we lack measurement invariance, irrelevant factors systematically influence the estimates our instruments are designed to produce.

Information

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2021

Chapter 1 Introduction

What Is Measurement Invariance?

The concept underlying measurement invariance is often introduced using a metaphoric example via physical measurements such as length or weight (Millsap, Reference Millsap2011). Suppose I developed an instrument to estimate the perimeter of any object. My instrument is invariant if it produces the same estimate of the object’s perimeter, regardless of the object’s shape. For example, if my instrument provides the same estimate of the perimeter for a circle and a rectangle that have the same true perimeter, then it is invariant. However, if for a circle and a rectangle of the same true perimeter my measure systematically overestimates the perimeters of rectangles, then my measure is not invariant across objects. The object’s shape should be an irrelevant factor in that my instrument is expected to provide an accurate estimate of the perimeter, regardless of the object’s shape. However, when we have a lack of measurement invariance, the estimated perimeter provided by my instrument is influenced not only by the true perimeter but also by the object’s shape. When we lack measurement invariance, irrelevant factors systematically influence the estimates our instruments are designed to produce.

We can apply the concept of measurement invariance from physical variables to variables in the social sciences. To do so, let’s suppose I have a constructed-response item, scored 0 to 10, that measures Grade 8 math proficiency. For the item to be invariant, the expected scores for students with the same math proficiency level should be equal, regardless of other variables such as country membership. However, if, for example, Korean students with the same math proficiency level as American students have higher expected scores than Americans, then the item lacks measurement invariance. In this case, an irrelevant factor (i.e., country membership) plays a role in estimating item performance beyond math proficiency. When using my non-invariant instrument to estimate the perimeter of an object, I need the estimate from my instrument, as well as the shape of the object, to provide an accurate estimate. The same is true for the non-invariant math item. To estimate accurately a student’s math proficiency, I would need their response on the item and their country membership. For an invariant math item, however, I would only need their item response.

While the use of physical measurements can be useful for introducing the concept of measurement invariance, there are two important differences when extending the idea to constructs in the social sciences, such as math proficiency or depression. First, the variables we measure in the social sciences are latent and cannot be directly observed. Instead, we make inferences from our observations that are often based on responses to stimuli such as multiple-choice, Likert-type, or constructed-response items. As a result, we must deal with unreliability, which makes it more difficult to determine whether our measures (or items) are invariant. Second, in the physical world we can obtain a gold standard that provides very accurate measurements. The gold standard can be used to match object shapes based on their true perimeter, which then allows us to compare the estimates produced by my instrument between different object shapes of the same perimeter. Unfortunately, there are no gold standards in the social sciences for the latent variables we are measuring. Latent variables that are used to match students are flawed to a certain degree, which, again, makes it difficult to assess measurement invariance.

Measurement invariance in the social sciences essentially indicates that a measure (or its items) is behaving in the same manner for people from different groups. To assess measurement invariance, we compare the performance on the item or set of items between the groups while matching on the proficiency level of the latent variable. While the idea of the items behaving in the same way between groups is useful for conveying the essence of measurement invariance, it is too simple to provide an accurate technical definition to understand the statistical approaches for examining measurement invariance. To fully understand what I mean by an item being invariant across groups within a population, I will begin by first defining the functional relationship between the latent variable being measured, which I will denote as θ, and item performance, denoted Y. The general notation for a functional relationship can be expressed as f(Y|θ)fYθ, which indicates that the response to the item or set of items is a function of the latent variable. For example, if an item is scored dichotomously (Y = 0 for an incorrect response, Y = 1 for a correct response), then f(Y|θ)fYθ refers to the probability of correctly answering the item given an examinee’s level on the latent variable, and can be written as P(Y = 1|θ)PY=1θ. For an item in which measurement invariance is satisfied, the functional relationship is the same in both groups; that is,

P(Y = 1|θG = g1) = P(Y = 1|θG = g2)
PY=1θG=g1=PY=1θG=g2.
(1.1)

G refers to group membership, with g1 and g2 representing two separate groups (e.g., Korea and America). Another way of expressing measurement invariance is that group membership does not provide any additional information about the item performance above and beyond θ (Millsap, Reference Millsap2011). In other words,

P(Y = 1|θG) = P(Y = 1|θ)
PY=1θG=PY=1θ.
(1.2)

To illustrate the idea of measurement invariance graphically, Figure 1.1 provides an example of the functional relationships for two groups on a dichotomously scored item that is invariant. The horizontal axis represents the proficiency level on the latent variable – in this book, I will refer to the level on the latent variable as proficiency. The vertical axis provides the probability of a correct response. Because the item is invariant, the functional relationships for both groups are identical (i.e., the probability of a correct response given θ is identical in both groups). The proficiency distributions for each group are shown underneath the horizontal axis. We can see that Group 1 has a higher proficiency than Group 2. The difference in proficiency distributions between the groups highlights the idea that matching on proficiency is an important aspect of the definition and assessment of measurement invariance. If we do not control for differences on θ between the groups, then differences in item performance may be due to true differences on the latent variable, not necessarily a lack of measurement invariance. The difference between latent variable distributions is referred to as impact. For example, since Group 1 has a higher mean θ distribution than Group 2, then Group 1 would have, on average, performed better on the item than Group 2, even if the functional relationship was identical, as shown in Figure 1.1. As a result, the proportion of examinees in Group 1 who answered the item correctly would have been higher compared to Group 2. However, once we control for differences in proficiency by conditioning on θ, item performance is identical. The fact that we want to control for differences in the latent variable before we compare item performance highlights the idea that we are not willing to assume the groups have the same θ distributions when assessing measurement invariance.

Figure 1.1 An example of functional relationships for two groups on a dichotomously scored item that is invariant.

Figure 1.2 illustrates an item that lacks measurement invariance. In this case, the probability of a correct response conditioned on θ is higher for Group 1, indicating that the item is relatively easier for Group 1. In other words, Group 1 examinees of the same proficiency level as examinees from Group 2 have a higher probability of answering the item correctly. When measurement invariance does not hold, as shown in Figure 1.2, then the functional relationships for Groups 1 and 2 are not the same (i.e., f(Y|θG = g1) ≠ f(Y|θG = g2)fYθG=g1fYθG=g2). Therefore, to explain item performance we need proficiency and group membership. In the case of non-invariance, the item is functioning differentially between the groups; in other words, the item is exhibiting differential item functioning (DIF). In this book, I will refer to a lack of measurement invariance as DIF. In fact, many of the statistical techniques used to assess measurement invariance are traditionally referred to as DIF methods.

Figure 1.2 An example of functional relationships for two groups on a dichotomously scored item that lacks invariance.

The concept of measurement invariance can be applied to polytomously scored items; that is, items that have more than two score points (e.g., partial-credit or Likert-type items). For a polytomous item, Y could refer to the probability of responding to a category, or it could refer to the expected score on the item. For example, Figure 1.3 illustrates an invariant (top plot) and non-invariant (bottom plot) functional relationship for a polytomous item with five score categories. The vertical axis ranges from 0 to 4 and represents the expected scores on the polytomous item conditioned on θ (i.e., E(Y = y|θ)EY=yθ). The expected item scores conditioned on proficiency are identical when invariance is satisfied but differ when the property of invariance is not satisfied.

Figure 1.3 An illustration of an invariant and non-invariant polytomous item.

Measurement invariance can also be extended to compare performance on a subset of items from a test (e.g., items that represent a content domain). In this case, the functional relationship looks a lot like a polytomous item in that we are comparing the expected score conditioned on the latent variable. When a scale based on a subset of items from a test lacks invariance, we often refer to it as differential bundle functioning (DBF). A special case of DBF is when we examine the performance of all items on a test. In this case, we are examining the invariance at the test score level. When the invariance is violated at the test score level, we refer to it as differential test functioning (DTF).

Why Should We Assess Measurement Invariance?

There are two basic reasons for why we should care about whether a test and its items are invariant across groups in a population. The first reason pertains to test validity in that the presence of DIF can impede test score interpretations and uses of the test. The Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014) describe five sources of validity: content, response processes, internal structure, relation to other variables, and consequences. Providing evidence to support measurement invariance is one of the aspects of internal structure. The presence of DIF is an indication that there may be a construct-irrelevant factor (or factors) influencing item performance. The consequences of having items that lack invariance in a test can be severe in some cases. For example, a lack of invariance at the item level can manifest to the test score level, leading to unfair comparisons of examinees from different groups. If the DIF is large enough, examinees may be placed into the wrong performance category (imagine how disheartening it would be, after working diligently to successfully build skills in, say, math, to be placed into a performance category below your expectation because of something other than math proficiency). A lack of invariance is not just a validity concern for large-scale assessments but for any test in which a decision is being made using a test score, such as remediation plans for struggling students in schools, determining whether an intervention is effective for a student, assigning grades or performance descriptors to students’ report cards, etc. The examples I have discussed so far have pertained to educational tests. However, the importance of measurement invariance also applies to noncognitive tests such as psychological inventories, attitudinal measures, and observational measures. In fact, it is important to establish measurement invariance for any measure prior to making any group comparisons using its results. Essentially, any time we are planning on using a score from an instrument, we should collect evidence of measurement invariance so that we can be confident that no construct-irrelevant factor is playing a meaningful role in our interpretations and uses.

In addition to the direct effect that DIF can have on test score interpretation and use, it can also indirectly influence validity through its deleterious effect on measurement processes. For example, DIF can disrupt a scale score via its negative effect on score equating or scaling when using item response theory (IRT). A common goal in many testing programs is to establish a stable scale over time with the goal of measuring improvement (e.g., the proportion of proficient students within Grade 8 math increases over consecutive years) and growth (each student demonstrates improved proficiency over grades). Tests contain items that are common between testing occasions (e.g., administration years) that are used to link scales so that the scales have the same meaning. If some of the common items contain DIF, then the equating or scaling can be corrupted, which results in an unstable scale. This type of DIF, where the groups are defined by testing occasion, is referred to as item parameter drift in that some of the items become easier or harder over time after controlling for proficiency differences (Goldstein, Reference Goldstein1983). A consequence of item parameter drift is that inferences drawn from test scores may be inaccurate (e.g., examinees may be placed into the wrong performance categories).

A second purpose for assessing measurement invariance is when we have substantive research questions pertaining to how populations may differ on a latent variable. For example, suppose we want to compare geographic regions on a math test. In addition to examining mean differences between regions, assessing measurement could provide useful information about how the items are functioning across the regions. We could find that certain domains of items are relatively harder for a particular region, suggesting that perhaps that group did not have the same opportunity to learn the content. Assessing measurement invariance in this context could also be useful for psychological latent variables. For instance, we may be interested in comparing gender groups on a measure of aggressiveness. Items that are flagged as DIF may provide insight into differences between the groups.

Forms of DIF

It is helpful to have nomenclature to classify the types of non-invariance. There are two basic forms of DIF: uniform and nonuniform (Mellenbergh, Reference Mellenbergh1982). Uniform DIF occurs when the functional relationship differs between the groups consistently or uniformly across the proficiency scale. The plot shown in Figure 1.2 provides an example of uniform DIF. In this case, the probability of a correct response for Group 1 is higher compared to Group 2 throughout the proficiency scale. At the item level, the difference in the functional relationships for uniform DIF is defined only by the item difficulty, whereas the item discrimination is the same in both groups. As I will describe in Chapter 3, where I address IRT, the item discrimination is related to the slope of the functional relationship curve. In uniform DIF, the curve shifts to the right or left for one of the groups, while the slope remains the same.

Nonuniform DIF occurs when the lack of invariance is due to the discrimination between the groups, regardless of whether the difficulty differs between the groups. Whereas for uniform DIF the item can only be harder or easier for one of the groups, nonuniform DIF can take on many forms. Figure 1.4 provides two examples of nonuniform DIF. In the top plot, the DIF is defined only by the difference in discrimination between the two groups; in this case, the curve is flatter for Group 2, indicating a less discriminating item compared to Group 1. The difference between Groups 1 and 2 in answering the item correctly depends on the θ value; for lower θ values, the item is relatively easier for Group 2, whereas for higher θ values, the item is relatively harder compared to Group 1. The bottom plot in Figure 1.4 provides another example of nonuniform DIF, but in this case the item differs with respect to discrimination and difficulty such that the item is less discriminating and more difficult in Group 2. When testing for DIF, our goal is often not only to detect DIF but also to describe the nature or form of the DIF.

Figure 1.4 Two examples of nonuniform DIF.

Another important factor to consider when describing DIF is whether the set of DIF items is consistently harder or easier for one of the groups. When the DIF is consistent across items (e.g., the DIF items are all harder in one of the groups), then it is referred to as unidirectional DIF. If, on the other hand, some of the DIF items are easier in one group, while some of the other DIF items are harder, then that is referred to as bidirectional DIF. The reason it is helpful to make this distinction is that the effect of unidirectional DIF can often pose a more serious risk to psychometric procedures such as equating and making test score comparisons. In addition, unidirectional DIF can also make it more difficult to detect DIF items in that the DIF has a larger impact on the latent variable used to match examinees (see discussion on purification procedures for further details and how to mitigate the effect of unidirectional DIF).

Classification of DIF Detection Methods

The statistical techniques used to assess measurement invariance that we will explore in this book can be classified under three general approaches. Each of the approaches differs with respect to how the latent variable used to match examinees is measured. The first class of DIF detection methods, referred to as observed-score methods, uses the raw score as a proxy for θ. The raw scores are used to match examinees when comparing item performance. For example, measurement invariance is assessed by comparing item performance (e.g., the proportion correct for a dichotomously scored item) for examinees from different groups with the same raw score. Observed-score methods have the advantage of providing effect sizes to classify a detected item as nontrivial DIF. The observed-score methods addressed in this book include the Mantel–Haenszel procedure (Holland, Reference Holland1985; Holland & Thayer, Reference Holland, Thayer, Wainer and Braun1988), the standardization DIF method (Dorans & Kulick, Reference Dorans and Kulick1986; Dorans & Holland, Reference Donoghue and Allen1993), logistic regression (Swaminathan & Rogers, Reference Swaminathan and Rogers1990), and the Simultaneous Item Bias Test (SIBTEST; Shealy & Stout, Reference Shealy, Stout, Holland and Wainer1993a, Reference Shealy and Stout1993b). I will describe the observed-score methods in Chapter 2.

The second class of methods uses a nonlinear latent variable model to define θ and subsequently the functional relationship. These methods rely on IRT models. The plots shown in Figures 1.11.4 are examples of item response functions provided by IRT models. One of the advantages of using IRT to examine DIF is that the models provide a convenient evaluation of DIF that is consistent with the definition of DIF. The IRT methods addressed in this book include b-plot, Lord’s chi-square (Lord, Reference Lord and Poortinga1977, Reference Lord1980), the likelihood-ratio test (Thissen, Steinberg, & Wainer, Reference Thissen, Steinberg, Wainer, Holland and Wainer1993), Raju’s area measure (Raju, Reference Olsson, Foss and Troye1988, Reference Raju1990), and differential functioning of items and tests (DFIT; Raju, van der Linden, & Fleer, Reference Raju, van der Linden and Fleer1995). I will describe the basic ideas of IRT in Chapter 3 and the IRT-based DIF methods in Chapter 4.

The third class of methods uses a linear latent variable model via confirmatory factor analysis (CFA). Although there is a strong relationship between CFA and IRT, and the methods used to examine DIF are similar in some respects, they are distinct in important ways. For example, CFA and IRT evaluate the fit of the respective latent variable model using very different approaches and statistics. One of the advantages of using CFA to assess measurement invariance is that it provides a comprehensive evaluation of the data structure and can easily accommodate complicated multidimensional models. The methods we will examine in this book include multigroup CFA for both continuous and categorical data, MIMIC (multiple indicators, multiple causes) models, and longitudinal measurement invariance. The fundamental ideas of CFA will be described in Chapter 5, with Chapter 6 containing a description of the CFA-based methods for assessing measurement invariance.

The Conditioning Variable: To Purify or Not to Purify?

As described previously, an important aspect of the definition of measurement invariance is the conditioning on the level of the latent variable; that is, the performance on the item (or set of items) is the same between the groups for examinees with the same proficiency or θ level. Therefore, it is crucial that a measure provides an accurate representation of the latent variable – a criterion that is not influenced by a lack of invariance or DIF. If the conditioning variable is influenced by DIF, then the examinees may not be accurately matched. For example, if several of the items on the criterion lack invariance in such a way that they are relatively harder in one of the groups, then examinees with the same raw score from different groups will not necessarily be equivalent on the latent variable. As a result, the comparison on the item performance may not be accurate because we have not matched examinees accurately, leading to a DIF analysis that identifies invariant items as DIF and DIF items as invariant. Therefore, it is crucial that we use a criterion that is invariant between the groups.

We can use either an internal or external criterion as the conditioning variable. An external criterion is one that is based on a variable that is not part of the items contained in the test being evaluated for measurement invariance. An internal criterion, on the other hand, uses a composite score based on the items from the test being assessed for measurement invariance. To illustrate the advantages and disadvantages of both approaches, consider a hypothetical situation where we are testing DIF between females and males on a math test used to place incoming university students into an appropriate math course. If all students take an admissions exam such as the SAT (Scholastic Aptitude Test), then we could feasibly use the SAT-Quantitative score as the conditioning variable. The advantage of using an external criterion is that if the math placement test contains DIF items, then those items will not influence our measure of θ. Of course, when choosing an external criterion, we would want evidence that it is invariant for the groups we are comparing. The challenge in using an external criterion is that it is rarely available, and the external criterion must be measuring the same latent variable as the test being assessed for measurement invariance (which may be questionable in many contexts). Both challenges typically preclude the use of an external criterion in most situations.

An internal criterion is the most common conditioning variable used when assessing measurement invariance, and in this book it will be used exclusively. The items used to define the conditioning variable are referred to as anchor items. The advantages of using an internal criterion are that it is readily available and it measures the relevant latent variable. The disadvantage of using an internal criterion is that if any of the anchor items contain DIF, then the conditioning variable, which is based on the items from the same test, may not be accurate (Dorans & Holland, Reference Donoghue and Allen1993).Footnote 1 One way to address this limitation is to purify the conditioning variable by removing the DIF items from the conditioning variable (Holland & Thayer, Reference Holland, Thayer, Wainer and Braun1988; Dorans & Holland, Reference Donoghue and Allen1993; Camilli & Shepard, Reference Camilli and Shepard1994). Although the details of purifying the conditioning variable vary across methods, the general procedure is as follows. First, we test each item for DIF using all items to define the conditioning variable. Second, we retest each of the items for DIF, but define the conditioning variable using only the items that were not identified as DIF in the first step. At this point, we can either continue this procedure until the same items are flagged as DIF in subsequent stages, or simply stop after the second step. There is evidence, however, that two stages are sufficient for purifying the anchor (Clauser, Mazor, & Hambleton, Reference Clauser, Mazor and Hambleton1993; Zenisky, Hambleton, & Robin, Reference Wainer, Holland and Wainer2003). The result of purifying the conditioning variable is that you have a set of anchor items that are invariant and, thus, provide an uncontaminated measure of the conditioning variable for both groups.

Although purifying the conditioning variable seems to solve the main disadvantage of using an internal criterion, it is not without limitations. First, because statistical tests used to identify DIF items are not infallible, it is possible still to have DIF items in the anchor. In fact, unless the statistical power is very high, then it is likely you will have at least one DIF item in the anchor. Although this can be a serious limitation, the purification approach seems to remove the most egregious DIF items, leaving items that may not have a meaningful impact on the conditioning variable. The second limitation of the purification approach is that when we remove items from the anchor, we are reducing the reliability of the conditioning variable, which can have a negative impact on the DIF statistics, especially those that use observed scores to define the conditioning variable described in Chapter 2. This can be particularly problematic when using large sample sizes where statistical tests tend to flag many items as DIF. In this case, it may be prudent to use an effect size along with the statistical test to flag an item as nontrivial DIF when implementing the purification approach (more on the use of effect sizes will be described later in this chapter). Regardless of the limitations of the purification approach, it is wise to use it when testing for DIF.

Considerations When Applying Statistical Tests for DIF

Statistical and Practical Significance

Most approaches for assessing measurement invariance rely on traditional statistical significance testing in which the null hypothesis being tested states that the item is invariant between populations, and the alternative hypothesis states that the item functions differentially between the populations. Although significance tests are useful for assessing measurement invariance, the challenge in applying them is that the traditional null hypothesis is always false in real data (Cohen, Reference Cohen1994). To elaborate on this issue given the definition of measurement invariance, the traditional null hypothesis specifies that the item performance conditioned on θ is identical in both populations. For example, for a dichotomously scored item, this means that the probability of a correct response given θ is identical for all examinees (as shown in Figure 1.1). Unfortunately, this is an unrealistic hypothesis, even for items that function comparably in the populations. The consequence of testing this type of null hypothesis in real data is that, as the group sample sizes increase, the statistical tests tend to be statistically significant, even for differences that are trivial. This problem (which is a ubiquitous problem in all statistical significance tests that test a point-null hypothesis) has led some researchers to state that the test statistics have too much power or are oversensitive, when in fact the problem is that the null hypothesis is always false and, as a result, meaningless. Furthermore, testing the point-null hypothesis does not support the inferences we want to draw, including the main inference that the DIF is nontrivial in the population.

There are two basic strategies to help alleviate this problem. The most common solution is to examine effect sizes post hoc to help judge the practical significance of a statistically significant result. In this strategy, referred to as a blended approach (Gomez-Benito, Hidalgo, & Zumbo, Reference French and Maller2013), we first test an item for DIF using traditional significance testing. If the DIF is statistically significant, we then use an effect size to qualify whether the DIF is trivial or nontrivial. The advantage of this approach is that it is relatively simple to apply (presuming you have an effect size available), and it helps support the claim we want to make – that is, the DIF is nontrivial. The disadvantage of the blended approach is that it does not control the Type I error rate for our desired claim. A Type I error for our claim is that the lack of invariance is nontrivial, when in fact the DIF is trivial.

The second strategy is to test a range-null hypothesis (Serlin & Lapsley, Reference Rogers and Swaminathan1993) instead of a point-null hypothesis. The range-null hypothesis approach specifies under the null hypothesis a magnitude of DIF that is considered trivial; in other words, the null hypothesis states that the item is essentially as DIF-free as can be expected in real data (but not exactly DIF-free). The alternative hypothesis states that the item displays nontrivial DIF. The advantage of this approach is that a statistically significant result implies the DIF is nontrivial and it controls the Type I error rate for the desired claim or inference. A further advantage is that it requires the test developer to consider what trivial and nontrivial DIF are prior to the statistical test. The disadvantage of this approach is that it requires large sample sizes to have sufficient power to reject a false null hypothesis, and it is difficult to apply for every statistical test.

For every statistical approach addressed in this book, my goal is to describe how to determine whether the DIF is trivial or nontrivial, not simply to determine if an item is functioning differentially. To help support this goal, I will rely heavily on the blended approach and will apply the range-null hypothesis when discussing Lord’s chi-square DIF statistic.

Type I Error Rate Control

When assessing measurement invariance, we often test many items for DIF. This introduces the problem of having inflated Type I error rates. For example, if you test 50 items for DIF, then the chances of observing at least one Type I error (when the items are DIF-free) is very high; more specifically, for α = 0.05,

P(at least one rejection|H0 is true) = 1 − (1 − 0.05)50 = 0.92
P(atleastonerejectionH0is true)=110.0550=0.92.
(1.3)

This is problematic for at least two reasons. First, if you are trying to understand the causes of DIF, then flagging items that do not function differentially will only add confusion to the analysis, leading to a lack of confidence in the results or the manufacturing of erroneous claims about why the groups differ. Second, we may decide to remove an item that exhibits DIF from the test for fairness issues or so that the item does not affect the psychometric operation (e.g., equating). In this case, we are removing an item unnecessarily, leading to less reliable results and a financial loss. Developing effective items is a time-consuming and financially extensive procedure – items cost a lot of money to develop. Both of these reasons illustrate the importance of having a controlled Type I error rate when testing for DIF.

There are two basic solutions to the problem of inflated Type I error rates. The first is to use a procedure to correct for the inflated Type I error rate, such as familywise error rate procedures (e.g., Dunn-Bonferroni; Holm, Reference Holm1979) or the false discovery rate procedure developed by Benjamini and Hochberg (Reference Benjamini and Hochberg1995). Of these procedures, it seems that controlling the false discovery rate using the Bejamini-Hochberg procedure is more appropriate than controlling the familywise error rate, given that it has more statistical power and the collection of items being tested for DIF does not necessarily comprise a family of tests that is connected to an overarching hypothesis. For example, when testing 50 items for DIF using α = 0.05, the α level for each comparison would be 0.05/50 = 0.001 for each item when using Dunn-Bonferroni. Unless we are working with very large sample sizes, this α level may be overly strict, which has a negative impact on power.

The second solution is to use effect sizes to help qualify statistically significant results as nontrivial or meaningful. In this case, items that are falsely flagged as DIF must also pass a benchmark indicating that the DIF is nontrivial. As a result, many of the items that are Type I errors would be ignored as being trivial and be classified as essentially free of DIF. Meeting statistical significance and having an effect size that exceeds a certain value helps mitigate the problem with inflated Type I error rates that is addressed in the blended and range-null hypothesis results (although, technically, the range-null hypothesis does a better job of controlling the Type I error rate for trivial DIF items).

The inflated Type I error rate is not only an issue when testing many items within a test, but also when comparing more than two groups for each item. For example, in international assessments such as the Trends in International Mathematics and Science Study (TIMSS), we have access to examinees from many countries, and we may want to examine whether the items are performing differentially between several countries. In this case, we perform multiple tests of DIF on each item. Here, however, the analyses belong to a family of comparisons and using one of the familywise error rate control procedures is reasonable. If we are comparing three groups, for instance, then using Fisher’s least significant difference (LSD) is most powerful, presuming we can test the omnibus hypothesis using a test statistic. For analyses with four or more groups, a procedure such as Holm’s (Reference Holland1979) method is perhaps most appropriate and powerful.

Footnotes

1 In fact, if all items are systematically harder for one group, then we cannot flag any items as DIF because the bias will be subsumed into the difference in the latent variable distributions of the groups.

Figure 0

Figure 1.1 An example of functional relationships for two groups on a dichotomously scored item that is invariant.

Figure 1

Figure 1.2 An example of functional relationships for two groups on a dichotomously scored item that lacks invariance.

Figure 2

Figure 1.3 An illustration of an invariant and non-invariant polytomous item.

Figure 3

Figure 1.4 Two examples of nonuniform DIF.

Accessibility standard: Unknown

Accessibility compliance for the HTML of this book is currently unknown and may be updated in the future.

Save book to Kindle

To save this book to your Kindle, first ensure no-reply@cambridge-org.demo.remotlog.com is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

  • Introduction
  • Craig S. Wells, University of Massachusetts, Amherst
  • Book: Assessing Measurement Invariance for Applied Research
  • Online publication: 13 May 2021
  • Chapter DOI: https://doi.org/10.1017/9781108750561.002
Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

  • Introduction
  • Craig S. Wells, University of Massachusetts, Amherst
  • Book: Assessing Measurement Invariance for Applied Research
  • Online publication: 13 May 2021
  • Chapter DOI: https://doi.org/10.1017/9781108750561.002
Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

  • Introduction
  • Craig S. Wells, University of Massachusetts, Amherst
  • Book: Assessing Measurement Invariance for Applied Research
  • Online publication: 13 May 2021
  • Chapter DOI: https://doi.org/10.1017/9781108750561.002
Available formats
×