The growing nationwide interest in educational assessment and accountability emphasizes the importance of accuracy in educational measurement. The use of open-ended (i.e. constructed response) test items has become commonplace in standardized educational assessments, including national and state-level tests that influence educational policy. Responses to open-ended items are usually evaluated by human ``raters,'' often with multiple raters judging each response. A pragmatic model of an assessment system using rated open-ended items must be able to accurately capture the variability and dependence among the raters as well as address the performance of the raters, both as a group and individually. There is a lack of consensus in the measurement community as to an optimal modeling choice, with demonstrably inferior models often used in practice. In this talk, I will compare several models for rating data. One particular choice (the Hierarchical Rater Model, HRM, of Patz, Junker, Johnson and Mariano, 2002) provides a more realistic view of the uncertainty of inferences on parameters and latent variables from rated test items by appropriately accounting for dependence between ratings. I will use the HRM to demonstrate how various levels of rater performance affect accuracy in estimates of examinee proficiency. To illustrate the value of the HRM in understanding and improving rater performance, I will summarize the results of a study of rating modality---the design for distributing items among raters---in a state-level assessment program and discuss the potential of the model for incorporating other important covariates of rater behavior.
Accounting for Rater Variability and Dependence in Constructed Response Assessments
Room
305