Hostname: page-component-669899f699-vbsjw Total loading time: 0 Render date: 2025-04-28T17:11:09.923Z Has data issue: false hasContentIssue false

Examining Differential Item Functioning from a Multidimensional IRT Perspective

Published online by Cambridge University Press:  01 January 2025

Terry A. Ackerman*
Affiliation:
The University of Iowa
Ye Ma
Affiliation:
Amazon Web Services
*
Correspondence should be made to Terry A. Ackerman, The University of Iowa, 8 North Shore Drive, Edwardsville, IL62025, USA. [email protected]

Abstract

Differential item functioning (DIF) is a standard analysis for every testing company. Research has demonstrated that DIF can result when test items measure different ability composites, and the groups being examined for DIF exhibit distinct underlying ability distributions on those composite abilities. In this article, we examine DIF from a two-dimensional multidimensional item response theory (MIRT) perspective. We begin by delving into the compensatory MIRT model, illustrating and how items and the composites they measure can be graphically represented. Additionally, we discuss how estimated item parameters can vary based on the underlying latent ability distributions of the examinees. Analytical research highlighting the consequences of ignoring dimensionally and applying unidimensional IRT models, where the two-dimensional latent space is mapped onto a unidimensional, is reviewed. Next, we investigate three different approaches to understanding DIF from a MIRT standpoint: 1. Analytically Uniform and Nonuniform DIF: When two groups of interest have different two-dimensional ability distributions, a unidimensional model is estimated. 2. Accounting for complete latent ability space: We emphasize the importance of considering the entire latent ability space when using DIF conditional approaches, which leads to the mitigation of DIF effects. 3. Scenario-Based DIF: Even when underlying two-dimensional distributions are identical for two groups, differing problem-solving approaches can still lead to DIF. Modern software programs facilitate routine DIF procedures for comparing response data from two identified groups of interest. The real challenge is to identify why DIF could occur with flagged items. Thus, as a closing challenge, we present four items (Appendix A) from a standardized test and invite readers to identify which group was favored by a DIF analysis.

Type
Theory & Methods
Copyright
Copyright © 2024 The Author(s), under exclusive licence to The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Ackerman, T.A.. (1991). The use of unidimensional parameter estimates of multidimensional items in adaptive testing. Applied Psychological Measurement, 15, 1324.CrossRefGoogle Scholar
Ackerman, T.A.. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 6791.CrossRefGoogle Scholar
Ackerman, T.A., Evans, J.A.. (1994). The influence of conditioning scores in performing DIF analyses. Applied Psychological Measurement, 18, 329342.CrossRefGoogle Scholar
Ackerman, T. A., McCallaum, B., & Ngerano, G. (2014). Differential item functioning from a compensatory-noncompensatory perspective. Invited address to the International Congress of Educational Research, Haceppette University, Ankara, Turkey.Google Scholar
Ackerman, T. A. & Xie, Q. (2019). DIF graphical simulator. Educational Measurement: Issues and Practice, 38(1), 5. https://doi.org/10.1111/emip.12171.CrossRefGoogle Scholar
Ackerman,T. A. & Xie, Q. (2019). DIF graphical simulator. Educational Measurement: Issues and Practice, 38(1), 5. https://doi.org/10.1111/emip.12171.CrossRefGoogle Scholar
Bauer, D.J., Belzak, W.C., Cole, V.T.. (2020). Simplifying the assessment of measurement invariance over multiple background variables: Using regularized moderated nonlinear factor analysis to detect differential item functioning. Structural Equation Modeling: A Multidisciplinary Journal, 27, 4355.CrossRefGoogle ScholarPubMed
Bolt, D. M., & Johnson. (2009). Addressing score bias and differential item functioning due to individual differences in response style. Applied Psychological Assessment, 33 (5), 335352. https://doi.org/10.1177/0146621608329891.Google Scholar
Camilli, G. (1992). A conceptual analysis of differential item functioning in terms of a multidimensional item response model. Applied Psychological Measurement, 16 2129147.CrossRefGoogle Scholar
Camilli, G, Penfield, D.A.. (1997). Variance estimation for differential test functioning based on Mantel–Haenszel statistics. Journal of Educational Measurement, 34 2123139.CrossRefGoogle Scholar
Carlson, J.E.. (2017). Unidimensional vertical scaling in multidimensional space. ETS 11 Research Report Series, 2017 1128.CrossRefGoogle Scholar
Cattell, R.B.. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1 2245276PMID 26828106.CrossRefGoogle ScholarPubMed
Clauser, B. E. & Mazor, K. M. (1998). Using Statistical Procedures To Identify Differentially Functioning Test Items. An NCME Instructional Module. Educational Measurement: Issues and Practice, 17(1), 3144. https://doi.org/10.1111/j.1745-3992.1998.tb00619.x.CrossRefGoogle Scholar
Clauser, B.E., Nungester, R.J., Swaminathan, H. (1996). Improving the matching for DIF analysis by conditioning on both test score and an educational background variable. Journal of Educational Measurement, 33 4454464.CrossRefGoogle Scholar
Clauser, B. E. & Mazor, K. M. (1998). Using Statistical Procedures To Identify Differentially Functioning Test Items. An NCME Instructional Module. Educational Measurement: Issues and Practice, 17(1), 3144. https://doi.org/10.1111/j.1745-3992.1998.tb00619.x.CrossRefGoogle Scholar
Cohen, A.S., Kim, S.H., Baker, F.B.. (1993). Detection of differential item functioning in the graded response model. Applied Psychological Measurement, 17 4335350.CrossRefGoogle Scholar
De Boeck, P. (2008). Random item IRT models. Psychometrika, 73, 533559.CrossRefGoogle Scholar
Fleishman, J. A. & Lawrence, W. F. (2003). Demographic variation in SF-12 scores: true differences or differential item functioning. Medical care, 41(7), 7586. https://doi.org/10.1097/01.MLR.0000076052.42628.CrossRefGoogle Scholar
Ip, E. H. (2010). Empirically indistinguishable multidimensional IRT and locally dependent unidimensional item response models. British Journal of Mathematical and Statistical Psychology, 63, 395416. https://doi.org/10.1348/000711009x466835.CrossRefGoogle Scholar
Kolen, M.J., Brennan, R.L.Test equating, scaling, and linking: Methods and practices 2014 New YorkSpringer.CrossRefGoogle Scholar
Lim, H, Choe, E.M., Han, K. (2022). A residual-based differential item functioning detection framework in item response theory. Journal of Educational Measurement, .CrossRefGoogle Scholar
Liu, Y, Zumbo, B, Gustason, P, Huang, Y, Kroc, E, Wu, A. (2016). Investigating causal DIF via propensity score methods. Practical Assessment, Research and Evaluation, 21 13124.Google Scholar
Ma, Y., Ackerman, T., Ip, E., & Chung, J. (2023). The effect of the projective IRT model on DIF detection. IMPS 2023 Annual Meeting, College Park, Maryland, United States.Google Scholar
Mazor, K.M., Hambleton, R.K., Clauser, B.E.. (1998). Multidimensional DIF analyses: The effects of matching on unidimensional subtest scores. Applied Psychological Measurement, 22 4357367.CrossRefGoogle Scholar
Flowers, C.P., Oshima, T.C., Raju, N.S.. (1999). A description and demonstration of the polytomous-DFIT framework. Applied Psychological Measurement, 23 4309326.CrossRefGoogle Scholar
Holland, P. W., & Thayer, D. T. (1988). Differential item functioning detection and the Mantel–Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp.129145). Hillsdale, NJ: Lawrence Erlbaum. http://www.books.google.co.ke/books?isbn=1109103204.Google Scholar
Huang, P.H.. (2018). A penalized likelihood method for multi-group structural equation modelling. British Journal of Mathematical and Statistical Psychology, 71, 499522 121-130.CrossRefGoogle ScholarPubMed
Junker, B., & Stout, W. F. (1991). Robustness of ability estimation when multiple traits are present with one trait dominant. Paper presented at the International Symposium on Modern Theories in Measurement: Problems and Issues. Montebello, Quebec.Google Scholar
Kok, F. (1988). Item bias and test multidimensionality. In R. Lange Heine & J. Rost (Eds.), Latent trait and latent class models (pp. 263275). New York: Plenum Press. https://doi.org/10.1007/978-1-4757-5644-9_12.CrossRefGoogle Scholar
Li, Y.H., Lissitz, R.W.. (2000). An evaluation of the accuracy of multidimensional IRT linking. Applied Psychological Measurement, 24, 115138.CrossRefGoogle Scholar
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum https://eric.ed.gov/?id=ED312280.Google Scholar
Magis, D, Beland, S, Tuerlinckx, F, De Boeck, P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42 3847862.CrossRefGoogle Scholar
McKinley, R. L., & Reckase, M. D. (1982). The use of the general rasch model with multidimensional item response data.Google Scholar
Muthen, B, Asparouhov, T. (2018). Recent methods for the study of measurement invariance with many groups: Alignment and random effects. Sociological Methods & Research, 47, 637664.CrossRefGoogle Scholar
Oshima, T. C., Davey, T. C., & Lee, K. (2000). Multidimensional linking: Four practical approaches. Journal of Educational Measurement 37(4), 357373. http://www.jstor.org/stable/1435246.CrossRefGoogle Scholar
Penfield, R, Algina, J. (2006). A generalized DIF effect variance estimator for measuring unsigned differential test functioning in mixed format tests. Journal of Educational Measurement, 43 4295312.CrossRefGoogle Scholar
Raju, N.S.. (1988). The area between two item characteristic curves. Psychometrika, 53, 495502.CrossRefGoogle Scholar
Raju, N.S., van der Linden, W.J., Fleer, P.F.. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19, 353368.CrossRefGoogle Scholar
Ramsay, J. O. (1990). A kernel smoothing approach to IRT modeling. Talk presented at the Annual Meeting of the Psychometric Society at Princeton New Jersey.Google Scholar
Reckase, M.D. (2009) Multidimensional item response theory New YorkSpringer.CrossRefGoogle Scholar
Shealy, R, Stout, W.F., (1993). An item response theory model for test bias, In Holland, P, Wainer, H. (Eds.), Differential item functioning, HillsdaleErlbaum 197239.Google Scholar
Shealy, R, Stout, W.F.. (1993). A model-based standardization approach that separates true bias/DIF from group differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58, 159–19.CrossRefGoogle Scholar
Spray, J., Davey, T., Reckase, M., Ackerman, T. & Carlson, J. (1990). Comparison of two logistic multidimensional item response theory models. ACT Research Report ONR90-8.Google Scholar
Stout, W.F.. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psychometrika, 52 4589617.CrossRefGoogle Scholar
Strachan, T, Ip, E, Fu, Y, Ackerman, T, Chen, S.H., Willse, J. (2020). Robustness of projective IRT to misspecification of the underlying multidimensional model. Applied Psychological Measurement, 44 5362375.CrossRefGoogle ScholarPubMed
Strachan, T, Cho, U.H., Ackerman, T, Chen, S-H, de la Torre, J, Ip, E. (2022). Evaluation of the linear composite conjecture for unidimensional IRT scale for multidimensional responses. Applied Psychological Measurement, 46 5347360.CrossRefGoogle ScholarPubMed
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361370. https://doi.org/10.1111/j.1745-3984.1990.tb00754.x.CrossRefGoogle Scholar
Sympson, B. (1978) A model for testing with Multidimensional items. In Weiss, D. J. (ed) Proceedings of the 1977 Computerized Adaptive Testing Conference, University of Minnesota, Minneapolis.Google Scholar
Thissen, D, Steinberg, L, Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In Wainer, H, Braun, H.I.(Eds.), Test validity, Hillsdale NJErlbaum 147169.Google Scholar
Wang, M. (1985). Fitting a unidimensional model multidimensional item response data: The effects of latent space misspecification on the application of IRT Unpublished manuscript, University of Iowa.Google Scholar
Williams, N.J., Beretvas, S.N.. (2006). DIF identification using HGLM for polytomous items. Applied Psychological Measurement, 30, 2242.CrossRefGoogle Scholar
Wolfram, 2020 Wolfram Research, Inc., (2020). Mathematica, (Version 12.2), [Computer Software]. Champaign, IL.Google Scholar
Zhang, J, Stout, W.F.. (1999). The theoretical DETECT index of dimensionality and its application to approximate simple structure. Psychometrika, 64 2213249.CrossRefGoogle Scholar