New this month from John Norris and Larry Davis is a detailed comparison of the old TOEFL Independent Writing Task and the new TOEFL Writing for an Academic Discussion Task.
It notes that among test-takers who completed both tasks in operational settings, 50% received the same score (from 0 to 5) on each from a single human rater. 47% received a score that was +/- one point on the same scale.
Furthermore, the article notes: “We saw no difference in terms of the measures of cohesion that we evaluated, and overall very few differences in terms of specific measures of syntactic complexity, grammaticality and mechanics, or word use.”
Some differences were noted, though. According to the article:
“A few linguistic measures differed across tasks in a manner that may suggest a slightly greater orientation toward academic register in the IND writing task. These measures included slightly greater use of academic vocabulary, as well as somewhat longer noun phrases and clauses, features typical of academic writing. On the other hand, responses to the WAD task showed marginally higher lexical density (relative frequency of content words) and somewhat fewer word usage errors, both of which may be associated with shorter responses. “
Also, the Writing for an Academic Discussion task elicited more writing per unit of time.
It is worth noting that e-rater scores were not available for most of the WAD responses studied in this report as automated writing scoring was only added to the TOEFL Essentials Test in late 2022. I’d like to see more research into how human scores for WAD tasks compare to e-rater scores for the same.Also: given the change to the test, it may be a good time for a follow up the research done by Brent Bridgeman, Catherine Trapani and Yigal Attali in 2012 about the possibility that machine scores can differ (in terms of their closeness to human scores) for certain gender, ethnic and country groups.