Les Perelman has a new article in the Journal of Writing Assessment about current problems associated with automatic essay scoring in general, and the ETS e-rater in particular. This is stuff Perelman has written about before, but I do encourage you to read the article, even if you are familiar with his work. It is illuminating.
I want to touch on one observation made by Perelman in the article. He notes:
“Indeed, ETS researchers themselves acknowledge the susceptibility of e-rater to both coaching and gaming when discussing e-rater’s scoring mainland Chinese on average over a half a point higher than human raters (d = 0.60) on the GRE issue essay”
Perelman doesn’t mention it, but the effect is similarly pronounced in the TOEFL , (d = 0.25 on a 5 point scale) according to research from ETS.
Note, also, that Korean students experience a similar benefit, though it is not as large as the one experienced by Chinese students.
How Do Teachers “Game the System”?
In case you are curious, here’s how this kind of TOEFL prep works.
If we assume that almost every prompt can be supported with an argument about health, we teach the student to begin their first body paragraph with a topic sentence like this:
“To begin with, _________ can improve our overall physical condition.”
They just need to fill in the blank with their choice from the prompt.
Looking at some of my sample questions, this works quite often:
- To begin with, the widespread use of the Internet can improve our overall physical condition.”
- To begin with, technologies in the modern world can improve our overall physical condition.”
- To begin with, living in the country can improve our overall physical condition.”
- To begin with, modern diets can improve our overall physical condition.”
Of course, not EVERY question can be answered with a comment about health, but if you teach the student five or six different sentences, you can cover most of the prompts. Like:
- To begin with, ________ can improve our career prospects.
- To begin with, ________ can improve our creativity.
- To begin with, ________ can improve our relationships with our loved ones.
And so on. These will all sound a bit clunky when used (like the four above do) but they will be grammatically correct and will please the e-rater.
After that, we teach the student a totally memorized sentence to function as “explanatory” content immediately after that sentence. These just make some generalized sentences about the topic. Something like:
- “Most educated people agree that we cannot achieve anything in life without a body that is strong and healthy.”
- “It is undeniable that our overall quality of life is strongly affected by how much success we enjoy in our career.”
- “It is undeniable that we will live longer and more prosperous lives if we are imaginative.”
- “In my culture, everyone feels that maintaining close connections with loved ones is more important than anything else.”
Just one blandly generic sentence isn’t enough to get the student flagged for being off-topic. The student must depend on their own ability to write the rest of the essay, but if they use this technique in both body paragraphs, 15 or 20 percent of their essay will be written in perfect English. That’s a nice start.
Some students might use even more memorized content, but of course that increases the risk of being flagged as off-topic.
Does ETS Know?
Yeah. They say:
“Another possible explanation for the greater discrepancy between human and machine scores for essays from mainland China may be the dominance of coaching schools in mainland China that emphasize memorizing large chunks of text that can be recalled verbatim on the test. Human raters may assign a relatively low score if they recognize this memorized text as being somewhat off topic, though not so far off topic as to generate a score of 0. On the other hand, this grammatical and well-structured memorized text would receive a high score from e-rater. Although automated scoring engines can be trained to identify text that is not at all related to the assigned topic, they may not yet be sensitive enough to recognize this slightly off topic text.”
No shit. What surprises me, as a teacher, is that after saying this they just leave it hanging. Neither a solution nor a response is really offered.
They do say that:
“each essay is scored by at least one human plus e-rater. Second, if there is a discrepancy of 1.5 or more points (on the 0–5 score scale) between the human score and the e-rater score, an additional human score is obtained. The item score is then the mean of the three scores (2 human plus e-rater) unless one score is an outlier (more than 1.5 points discrepant), in which case the outlier is discarded and the remaining two scores are averaged.
But this is meaningless, as a 1.5 difference is huge. That’s a nine point difference on the 30-point scale. By using memorized content only in the independent task, a student could get a bonus of 2.25 points overall in the writing section without any alarm bells being sounded.
This is weird, because ETS uses a much better system to prevent this from being a problem in the GRE:
The GRE uses a highly conservative approach in which the machine is used only to flag discrepant human scores to signal the need for a second human rating. Specifically, if the e-rater score rounded to the nearest whole number does not agree exactly with the first human score, then a second human score is obtained; the e-rater score is never averaged with a human score
What about the Plagiarism Warning?
No, you cannot use the exact sentences mentioned above. Presumably ETS can write some software that will detect that you’ve copied from this website. However, wealthy students will just hire a teacher to write personalized content only for them. That content will not appear anywhere online, and it won’t even be used in anyone else’s tests. That’s how they can avoid being penalized by automatic plagiarism detection.
This is a Really Bad Thing
It is probably a bad thing that students from some ethnicities do better with the e-rater. Especially since some ethnicities, particularly African-Americans (on the GRE) and Arabic and Hindi speaking students (on the TOEFL), do worse.
There aren’t just racial implications, but class implications as well. As Perelman indicates (emphasis is mine):
“It is the following paragraph, however, that contains the most egregious instance of misinformation. “The primary emphasis in scoring the Analytical Writing section is on your critical thinking and analytical writing skills rather than on grammar and mechanics.” (Educational Testing Service, n.d.-a, n.d.-b). E-rater provides half the final score. Yet, e-rater does not emphasize “critical thinking and analytic writing skills.” Indeed, it is completely oblivious to them. Its closest approximation is its highly reductive feature of development, which is calculated by the number of sentences in each paragraph and the number of paragraphs in the essay. Furthermore, grammar and mechanics compose a significant portion of the features included in e-rater’s calculations. Low-income students will believe these statements and focus on critical thinking and analytic skills. Affluent students who have taken test preparation classes, on the other hand, will be coached to provide e-rater with the proxies that will inflate their scores.“
Anyways. That’s all I have to report. Thank you for listening to my TED Talk.