How Serious Is IRT Misfit for Practical Decision-Making? (RR 15-04)
Item response theory (IRT) is a mathematical model used to support the development, analysis, and scoring of tests and questionnaires. For example, IRT allows for the description of item (i.e., question) characteristics, such as difficulty, as well as the proficiency level of test takers. Various IRT models are available, and choosing the most appropriate model for a particular test is essential. Since the fit of the test data to the chosen model is never perfect, measuring the fit of the model to the data is imperative. After evaluating model fit, practitioners are left with the task of deciding whether to apply an alternative model or remove misfitting items.
Using simulated data, we investigated the consequences of misfitting items and item score patterns on three outcome variables that are often used in practice: pass/fail decisions, the ordering of persons according to their proficiency score, and the correlation between predictor and criterion scores. We concluded that for most simulated conditions the rank ordering of test takers was similar with and without misfitting items. The presence of aberrant response patterns in the data further deteriorated the model fit as well as the performance of statistics specifically aimed at detecting these aberrant item score patterns.