Date of this Version
The scale metrics used in educational testing are often arbitrary, and this can impact interpretation of scores on measurements. Both classical test theory sum scores and item response theory estimates measure the same underlying dimension, but differences in the two scales may lead one to be more preferential than the other in interpreting data. Mismatch between individual ability and test difficulty can further result in difficulties in correctly interpreting trends of development in longitudinal data. A previous limited simulation by Embretson (2007) demonstrated that classical test theory sum scores result in misinterpretation of linear trends of development, and that item response theory estimates improve upon the problem. This study replicates the results from the previous literature, as well as extends the results to include simulation of development in both quadratic and cubic trends. Results indicate that while item response theory scaling does improve estimates for the linear, quadratic, and cubic trends simulated, ultimately the two methods perform very similar to one another. Item response theory estimates resulted in marginally fewer Type I and Type II errors, especially when investigating interaction effects. The mismatch between test difficulty and ability level of test takers has the strongest impact on correctly interpreting how individuals develop over time.
Advisor: James A. Bovaird