E-Rater Background

Both TOEFL essays (independent and integrated) get a score from the e-rater and one human rater.  You can read my earlier study of the e-rater over here.  I strongly recommend you check that out, as it provides a very detailed summary of my research into what the e-rater checks for and how it scores essays.

To summarize, at that time I determined that the e-rater scores essays based on:  development, organization, mechanics, usage, grammar, word length, word frequency, vocabulary and style.  Those “macrofeatures” are broken down into smaller “microfeatures.”  Some of those are broken down into “sub-microfeatures” (my term).

That summary was mostly based on articles published before 2017.

Recent E-Rater Updates?

A few articles describing the e-rater have been published.  The most notable is a 2019 study of students in Germany and Switzerland published in March of 2019.   These students were given TOEFL independent and Integrated questions which they answered in realistic TOEFL conditions. 

It includes an updated e-rater flow-chart that suggests a few things:

  • According to the article, “Style” is no longer a macrofeature that the e-rater checks for.  My earlier article indicated that “style” was the least important macrofeature, accounting for just 3% of the total score, but it is interesting to note that it has been removed.  This might mean that students are not automatically punished for repeating words, writing short or long sentences, writing “inappropriate” words (profanity) and using the passive voice.  
  • A new macrofeature called “Sentence Variety” has been added.  According to the article, “this macrofeature measures the diversity of the syntactic constructions of the sentences in an essay.”  No microfeatures are given.
  • A new macrofeature called “Grammaticality” has been added to the e-rater.  That’s some Chomsky shit that seems to refer to the grammatical correctness of a sentence beyond whether it is meaningful or even included in a corpus.  The difference between this macrofeature and the specific grammar features it looks for is that grammaticality is judged on a “whole sentence” level.   Wikipedia says that grammaticality judgements are based on “a native speaker’s linguistic competence, which is the knowledge that they have of their language, allows them to easily judge whether a sentence is grammatical or ungrammatical based on intuitive introspection. For this reason, such judgments are sometimes called introspective grammaticality judgements.”  This almost seems like an effort to get the e-rater to replace the role of the human rater, which I will write a bit more about later.  ETS has written about this concept over here.
  • The “proofread this” grammar microfeature has become “multiple adjacent errors”
  • Determiner-noun agreement” has been added as a sub-microfeature
  • “Extraneous articles” has been added as a sub-microfeature (this is the use of an article when no article is needed.
  • Wrong part of speech” has been added as a microfeature.
  • “Extra comma” has been added as a sub-microfeature

Experimental E-Rater Features

The most interesting aspect of the article is that it describes a couple of “experimental” e-rater macrofeatures.  They are:

  • Discourse.  According to the article, it checks two things:  it “captures the organization of ideas in an essay. Based on the idea of lexical cohesion chains, it quantifies how ideas are initiated, continued, and terminated in essays.” Find a clear definition of lexical cohesion chains on Wikipedia.  Find an in-depth discussion in this article.  That article talks about “chains” of related words and how they can be used to grade essays.  This macrofeature “also encodes how ideas are presented in relation to the discourse cues used to organize the text for the reader.”  This second part seems to refer to the use of traditional discourse phrases like “therefore” or “in addition.”
  • Source-Use. According to the article, this is used only in the integrated task and “comprises three microfeatures that quantify (a) how much of the material in the essay is drawn from the lecture stimulus, (b) how much of the material in the essay is drawn from the lecture stimulus as compared to from the reading passage text, and (c) how important the information from the lecture stimuli that the test taker used is.”  Point (a) suggests that the lecture is the more important source to summarize. Point (b) suggests that the lecture should be weighted more heavily, but I am curious if there is some sort of ideal ratio.  Point (c) suggests that some information from the lecture is more important than other information.

Interestingly, these can be “prompt specific.”  That is, the e-rater can be programmed to look for specific lexical chains and specific details selected in advance by ETS to match the given prompt.  This is something that is already done in terms of the vocabulary macrofeatures. 

It will be a long time before we know if these experimental features have been implemented in the e-rater used on the real test.  The Source-Use feature, notably, seems like another effort for the e-rater to replace the human rater.  My understanding is that the human raters are given a simple chart summarizing the important details from the sources and must check if they are present in the essays they grade.

Other Interesting Bits

The article included a few other interesting facts worth noting here:

  • The essays were also graded by two human raters who agreed with each other about 80% of the time (independent) and 84% of the time (integrated).   These raters were among the best employed by ETS, which suggests that human scores are not in agreement that much when the regular TOEFL is graded.
  • The raters scored 14 essays per hour, plus did other administrative tasks.  That’s not much time spent per essay.
  • The students scored a lot lower on the integrated task.
  • The “Discussion” part of the article suggests that, yeah, ETS would like to have an automated-scoring engine that can be used in high-stakes assessments without a human grader.  But they aren’t there yet.
  • On the real TOEFL “human ratings for the integrated task currently receive twice the weight of machine scores.”  This was news to me, and I’ve been teaching this test for a long time.

Does this mean anything?

Maybe.  I don’t know.

An Old Video









Sign up for express essay evaluation today!

Submit your practice essays for evaluation by the author of this website.  Get feedback on grammar, structure, vocabulary and more.  Learn how to score better on the TOEFL.  Feedback in 48 hours.