It is worth sharing a few notes from a long article published by Pearson in the summer. It provides more information about the topics discussed in Jarrad Merlo’s webinar about the introduction of human raters to the PTE tests.
The article describes how a “gaming detection system” has been developed to aid in the evaluation of two of the speaking questions and one of the writing questions on the PTE. This system gives each response a numerical score from 0 to 1, with 0 indicating the complete absence of gaming and 1 indicating significant evidence of gaming.
This numerical approach seems wise, as the lines between acceptable and unacceptable use of “templated responses” in a response are sometimes blurred. In a future post, I’ll summarize some research from ETS that discusses this topic.
Meanwhile, Pearson’s article notes that:
“While more rudimentary systems may rely on a simple count of words matching known templates, PTE Academic’s gaming detection system has been designed to consider a number of feature measurements that quantify the similarity of the response to known templates, the amount of authentic content present, the density of templated content, and the coherence of the response”
The article goes on to describe how the results of the AI checks are passed along to human raters, to aid in their decision-making regarding the content of responses. It notes that the newly-implemented system:
“enables raters to make better informed content scoring decisions by leveraging the Phase I gaming detection systems to provide them with information about the about the [sic] extent of gaming behaviours detected in the response.”
That’s fascinating. I’m not aware of another system where human raters can make use of AI-generated data when making decisions about test taker responses.
The article notes that human raters will not check all written responses for templated content. Checks of most responses will be done entirely by AI that has been trained on a regularly-updated database of templates discovered via crawls of the web and social media.
A challenge with this approach that goes unmentioned is the difficulty of detecting templates that don’t show up on the public web. In my neck of the woods, students pay high fees to “celebrity” test prep experts who create personalized templates that are neither shared publicly nor repeated by future test takers. This came up in an article by Sugene Kim which I’ll share in the comments.
Perhaps Pearson should go whole hog and bring in human raters for some or all responses in the writing section as well.
More on this in the days ahead.