Here’s a challenging but fascinating article from ETS (Jodi M. Casabianca, Dan McCaffrey, Mathew S. Johnson, Naim Alper, Vladimir Zubenko) about using generative AI to score constructed responses.  Test watchers might enjoy the included “Demonstrative Study Using GPT4 for Scoring” in which 1,581 responses (TOEFL, GRE and Praxis) previously scored by trained human raters and ETS’s e-rater (which scores responses based on features they contain) were submitted to GPT4 for scoring.  GPT4 was provided the response, the question and the rubric.  The scores produced were compared to those provided earlier by e-rater.  In an earlier draft of this post I attempted to summarize the results.  Alas, that is somewhat beyond my meager abilities.  Do check out the article for yourself (beginning on page 20).

Also mentioned is the possibility of combining a human rater’s score with both of the AI scores, and (more interestingly) the possibility of replacing human raters with GPT4.  But, as is mentioned by the authors, “In this case, the three tests are all high stakes and the evidence is too weak to support the use of these scores in operational score reporting unless they are used in combination with e-rater scores and/or human ratings.”  And furthermore:  “Based on the small sample sizes, the concordance with human ratings was borderline, especially for the TOEFL task. e-rater outperformed GPT4 for the three tests (GRE, TOEFL, Praxis). In these cases, without additional evidence we would retain the e-rater model.”

Subscribe
Notify of

0 Comments
Inline Feedbacks
View all comments