How does the TOEFL e-rater Work?

 

A General Introduction to the e-rater and Teaching TOEFL Writing

Published: December, 2018

Note: this website is not affiliated with ETS.

(I have written a second and a third article about this topic)

Both TOEFL essays (independent and integrated) get a score from the e-rater and one human rater.  These scores are averaged out to produce a final score out of thirty points.  It is important to remember that the e-rater and the human rater usually produce the same score when evaluating essays so the “averaging out” is somewhat irrelevant.

So where does the score come from?  And how can we use knowledge of the e-rater to help students pass the TOEFL?

In this article I want to describe:

  • The main categories the e-rater uses to score essays and how much of the writing score comes from each category
  • The smaller sub-categories these are sometimes broken down into
  • The sub-categories that most affect students on test day
  • How this information can be used to teach students better

Some problems are worth mentioning here:

  • Most of the published information about the e-rater is about the independent TOEFL essay.  This article mostly refers to that essay type.
  • My information is probably out of date.  The e-rater is adjusted every year, so all articles inevitably describe old versions.  It seems that changes are minor, though.

Note that at the end of the article there is a video version of everything.

Main e-rater Categories (Macrofeatures)

The e-rater gives students scores in specific categories, called macrofeatures. Each of these has a different weight when it comes to scoring.  In 2010, the macrofeatures were listed as (and the number of points out of 30 the weight works out to):

  • Organization (32%, 9.6 points)
  • Development (29%, 8.7 points)
  • Mechanics (10%, 3 points)
  • Usage (8%, 2.4 points)
  • Grammar (7%, 2.1 points)
  • Lexical Complexity – word length (7%, 2.1 points)
  • Lexical Complexity – less frequent words (7%, 2.1 points)
  • Style (3%, 1 points)

Source. Yes, the above works out to 101%.  Blame the source.

Problem: These figures are from 2010.  Since then two more categories have been introduced.  They are “positive features” and “topic-specific vocabulary” (source).  My guess would be that each is weighted at about 3% and that the two “lexical complexity” categories have been deweighted, but that isn’t based on any documented evidence.

Defining Macrofeatures

It is best to consider the macrofeatures as either “technical stuff” (specifically: grammar, usage, mechanics, style) and “content stuff” (specifically: organization, development, lexical complexity, positive features, topic-specific vocabulary).

We can look at them one at a time.

Development (29% of score)

This is specifically defined by ETS as “background, thesis, main ideas, supporting ideas, and conclusion” (source).  All of these need to be presented using a series of paragraphs.  I teach my students to use a four paragraph model to achieve the desired organization in the independent essay.  The “background” (which I call a “hook”) and “thesis” are contained in the introductory paragraph.  Each of two body paragraphs contains a main idea (which I call a topic sentence) and supporting ideas (which I call elaboration sentences and personal examples). The model ends with a short conclusion.  I also teach the use of certain phrases (templates) that ensure the e-rater knows these features are being included.

For the integrated task, my model has students indicate the “background” (topic) of the sources in the first line. As a “thesis” it indicated the relationship between the two sources (always casting doubt).  As main and supporting ideas it presents the specific ways in which the lecture challenges the reading.

Problems: Old articles about the e-rater specifically state that a five paragraph structure is necessary (that is, three main arguments).  As my students have gotten perfect scores with a four paragraph structure, I feel this specific requirement somewhat out of date.

Organization (32% of score)

According to ETS, “for the organization feature, e-rater computes the average length of the discourse elements (in words) in an essay” (source).

Obviously, then, a longer essay is better.  However, writing a longer essay can cause students to make more mistakes and reduce their scores in the categories of grammar, usage, mechanics and style.  It is important to find a “sweet spot” so that essay lengths match the abilities of specific students. I generally recommend about 400 words for the independent task and 300 words for the integrated task.

It isn’t supported by and published articles, but I also believe that this feature requires long body paragraphs, a shorter introduction and an even shorter conclusion.  Otherwise, the category would be merely the total overall word count.

Lexical Complexity – Less Frequent Words (perhaps 4% of score)

This assesses the level of the words used in the essay based on their frequency in “a large corpus of text” (source)  Obviously, then, less-frequently used words are considered a more advanced, and therefore higher-scoring.  A chart of word frequency can be found online.  To meet this requirement I generally encourage students to use more advanced vocabulary (within reason).  More specifically, I encourage students to avoid using very common adjectives like “good” or “big.”  These can easily be replaced with something more infrequent.

Lexical Complexity – Word Length (perhaps 4% of score)

Students are rewarded for using longer words.  This is fairly straightforward. 

Positive Features (Collocations and Prepositions) (perhaps 3% of score)

First of all, students are rewarded for collocation use.  The e-rater identifies “the number of good collocations [divided by] the total number of words” (source).  So, more collocations equals a higher score.  I guess there is a limit, however. A giant list of possible collocations can be found here.  A more learner-friendly list can be found here.

Secondly, students are rewarded for preposition use.  This is described as “The mean probability of the writer’s prepositions” (source) but I am not quite sure what this means.

Topic-Specific Vocabulary (perhaps 3% of score)

This one is new.  The vocabulary in the student’s essay is compared to vocabulary used in high-scoring essays based on the same prompt (source).  Obviously it is hard to prepare students for this, but if they are aware that it is a factor, they can be encouraged to use “advanced words” that are more closely related to the general theme of the given prompt.  For example, if the prompt is related to university life I encourage them to use some “advanced words” related specifically to attending university.

Grammar (7% of score)

The grammar macrofeature is broken down into nine microfeatures.  All of these features are weighted equally to produce the grammar score.  Microfeature penalties are determined by dividing the number of related errors by the total number of words in the entire essay.  The microfeatures are listed below. The number in parenthesis is the percentage of students who received NO PENALTY during a study of ~95,000 TOEFL independent essays graded by the e-rater.  A lower number here indicates a potential area of concern for students studying for the TOEFL. The source of all of these, and the rest of the macrofeatures on this page, is this article.  Further descriptions of each microfeature can be found in this article.

  • Sentence Fragments (79.3)
  • Run-on Sentences (73.2)
  • Garbled Sentence <five or more errors> (89.3)
  • Subject-verb agreement (48.8)
  • Ill-formed verb <the wrong verb tense for the given situation> (61.3)
  • Pronoun error (97)
  • Possessive error <missing apostrophe> (85.6)
  • Wrong or missing word (95.4)
  • Proofread this! <errors that cannot be analyzed> (76.6)

The biggest area of concern – verbs – should come as no surprise to teachers.

Usage (8% of score)

Usage works the same way.  There are nine microfeatures that have equal weight. The number in parenthesis is the percentage of students who received no penalty during the study period.

  • Determiner noun agreement <singular determiner with a plural noun and vice versa.  Also a/an errors> (63)
  • Article errors <wrong, missing and extraneous> (9.5)
  • Homophone errors  (59.9)
  • Verbs used as nouns (94.1)
  • Faulty comparisons <errors with more and most> (96.1)
  • Preposition errors <missing, incorrect and extraneous> (61.8)
  • Nonstandard word usage <gonna, kinda, wanna> (99.3)
  • Double negatives (99.6)
  • Wrong parts of speech (97.7)

It is no surprise that article errors (including determiner noun agreement) are a problem for students.  Likewise it is no surprise that prepositions are a problem area.  I don’t actually believe that homophone errors are such a problem, but that is what the study reported.  Perhaps this is a weakness of the e-rater.

Mechanics (10% of score)

This works the same as the above categories.  The microfeatures are:

  • Spelling errors (2.6%)
  • Capitalization of proper nouns (79.7)
  • Capitalization of first word in a sentence (77.7)
  • Missing question marks (95.8)
  • Missing periods (86)
  • Missing apostrophes (96.4)
  • Missing commas (40.5) 
  • Missing hyphens <including in number constructions> (96.1)
  • Fused words <missing space between words> (97.7)
  • Compound word errors <two words that should be one> (63)
  • Duplicates <accidentally repeating words in a row> (91.5)
  • Extraneous comma (69)

Again, the problem areas for students are not too shocking.  Spelling mistakes are a problem for almost everyone.  Comma (both types of errors) are also hard.  It is somewhat surprising that students are “throwing away” a lot of points on things like capitalization errors and missing periods.  At first glance one would think that ETS is being generous by assigning ten percent of the total score to “mechanics” but it isn’t such a giveaway after all.  I strongly encourage my students to proofread their essays for a minute or two.

Style (3% of score)

  • Repetition of words (22.3)
  • Inappropriate words <including expletives> (99.8)
  • Too many sentences beginning with a coordinate conjunction <“too many” is not defined>  (96.2)
  • Too many short sentences <more than four sentences with fewer than 7 words> (94.5)
  • Too many long sentences <more than four sentences with more than 55 words>  (88.4)
  • The use of “by passives” <defined as: “sentences containing BE + past participle verb form, followed somewhere later in the sentence by the word by” (82.4)

The fact that repetition of words is the biggest problem really proves that vocabulary is critically important to a good TOEFL score.  The value here might amount to .5 points (out of 30), while the lexical complexity categories make up another 2.4 points. Topic specific vocabulary probably comes out to another 1.1 points.  These categories can only be satisfied by using a wide range of words. Teach your students to vary their vocabulary as much as possible.

Does this mean anything?

Maybe.  Here’s a few things that inform my teaching of TOEFL writing:

  • An essay with perfect organization and development can score 18 points, even if its grammar is abysmal.  Indeed, I very rarely see score reports with a writing score of less than 17 points.  Even the absolute lowest-level students can score that much.  This is important to keep in mind when students have overall target scores (all sections) in the 70s or 80s.
  • The various features related to vocabulary come out to about four or five points.  As I said above, I always emphasize range of vocabulary when teaching TOEFL writing.
  • I teach my students to proofread.  They can “make up” for mistakes related to more difficult aspects of writing by fixing up easy punctuation and spelling errors.
  • It is possible now to know which kinds of grammar mistakes students usually make on the test, although none of these should be surprising. 
  • I always emphasize to my students that all mistakes are equal.  Those “sloppy” punctuation errors they make hurt just as much as the perplexing verb errors.
  • A longer essay can result in a higher score by “diluting” the mistakes, but it could obviously lead to more mistakes.  It is necessary to work with individual students to discover the best length for them.
  • Teachers have long-held that the e-rater rewards the use of transitional adverbs.  I don’t know where they fit into the above categories, but I will continue to emphasize their use.

Beating the e-rater

Is it possible to “beat” the e-rater?  Yes, of course.  This was discussed in the New York Times some years ago.  However, anything off-topic will be flagged by the human rater, so the techniques described by the researcher won’t all work.  Moreover, it is likely that only a student with advanced English can actually “beat” the e-rater.  As the article says:

“E.T.S. officials say that [the researcher’s] test prep advice is too complex for most students to absorb; if they can, they’re using the higher level of thinking the test seeks to reward anyway. In other words, if they’re smart enough to master such sophisticated test prep, they deserve a [high score].”

Video Version

 

 

 

 

 

 

 

Sign up for express essay evaluation today!

Submit your practice essays for evaluation by the author of this website.  Get feedback on grammar, structure, vocabulary and more.  Learn how to score better on the TOEFL.  Feedback in 48 hours.