TOEFL Gangnam Style: Part Four!

In 2018, ETS published a report on the use of their e-rater technology in assessing the GRE essays. I am a stupid man and it is hard for me to read academic articles. But I still return to this article now and again to see what I can find.

My “TOEFL Gangnam Style” series of blog posts this month inspired me to take yet another look at the article to see if it had anything to say about the topic of memorized essays. I was pleased to find that it did explain this phenomenon somewhat. Regarding how ETS’s human raters treat this kind of “shell text,” when grading GRE essays it reports (emphasis mine):

The raters also found the use of shell text along with the text that is part of the writing prompt/question prevalent across the essay responses of test takers from China, particularly for argument, resulting in three-paragraph-long responses but suffering from lack of ideas and poor language. The raters identified instances of the use of generic (non-content-specific) statements, syntactic repetition at the beginning of each paragraph, repetition of ideas around generic statements or prompt text, and lack of cohesion in the essay after the first few memorized sentences as cues to shell text. The raters understood that test takers from China frequently use shell text, but they do not view shell text as problematic or as a negative style of writing. They emphasized that they are trained to be neutral to the use of shell text and that they look for original ideas and content beyond the shell text in the response to determine the appropriate score. In some cases, they expressed the idea that the examinees are able to use shell text cleverly to enhance the structure and framework of their responses without compromising originality, cohesion, and content .

How about that? That’s the first I’ve heard about how ETS raters handle templates. I repeat: “they are trained to be neutral to the use of shell text.”

But more interesting:

Presently, human raters are trained in scoring to identify and treat any shell text neutrally while scoring the essay response. The e-rater lacks any such training presently and may be overscoring essays with the heavy presence of shell text.

Yeah, it probably is.

For TOEFL teachers, the thing to take away from the article is that the shell text (template) is not tricking the human raters into giving higher scores. They are treating that text neutrally. It is the e-rater that may be getting tricked, as it reports that Chinese students are getting much higher scores from the e-rater than from the human raters. That’s the point!

The article suggests that e-rater could be adjusted to include some kind of shell-text detection to lower this discrepancy. I do not know if that has been implemented since the research was done. Keep an eye on Chinese and Korean writing scores when the TOEFL score data for 2020 is reported this summer.

Anyways, this study was done with GRE essays, so take it with a grain of salt. To me, it seems to explain some of what was observed in the original “Gangnam Style” article in Assessing Writing that inspired this whole blog series. But I could be wrong.

Note: A much older article confirms that Chinese students get higher e-rater scores than human scores on the TOEFL, but it doesn’t talk about shell text. It also points out that Korean students get higher e-rater scores as well (but the difference is slighter).

(you can also read parts one, two, and three of this series!)