The Educational Testing Service (ETS) just published a document called “Reimagining Educational Assessments: AI Innovations for Enhancing Test Taker Experience.”

The document says, seemingly in reference to scoring of the TOEFL iBT, that:

“ETS combines the efficiency of AI with essential human oversight. While AI manages most of the scoring, human raters review a sample of the machine-scored responses.”

That appears to be a departure from how the TOEFL has traditionally been scored.  Traditionally, every response (not just “a sample”) has been graded by both a human rater and by AI.  Until now, it has never been accurate to say that AI “manages most of the scoring.”

That said, the phrasing used in the document is somewhat vague.  Maybe I’ve misunderstood it.  Perhaps someone from ETS can confirm what it means.

UPDATE: I have been informed by reliable sources that there has been no change to the scoring process.

This comes a few months after ETS began the process of offshoring human scoring of TOEFL test taker responses to facilities in India.

A quick update to my post about the fee for express scoring of the TOEFL test.  I mentioned a few days ago that the fee was increased to $149.  There is an exception, however.  For tests taken in India, the fee is about $75.

This is the sort of thing that drives test takers bonkers.  Though they all understand the concept of regional pricing, it does stick in the proverbial craw.

As I’ve mentioned here many times, a student who takes the TOEFL from his bedroom in Palestine pays $270 to ETS.  Meanwhile, a student here in Korea is charged $220 to take the test at home.  Some other kid, taking the test from his bedroom in Switzerland, will pay $470.  Though the tests are the same and the delivery method is the same, the prices are quite different.

Test takers are bright enough to know why it has to be that way… but some find it unfair.  In earlier posts I’ve explored how newer tests have increased their popularity by instituting a single global price for at-home administrations (or something very close to a single price).

Right now, it seems like Pearson is somewhat disinterested in at-home testing for the PTE. Meanwhile, the at-home IELTS remains limited to a relatively small handful of countries. I certainly admire ETS’s continued commitment to this delivery method… but I think they could do a bit better.

 

Here’s a challenging but fascinating article from ETS (Jodi M. Casabianca, Dan McCaffrey, Mathew S. Johnson, Naim Alper, Vladimir Zubenko) about using generative AI to score constructed responses.  Test watchers might enjoy the included “Demonstrative Study Using GPT4 for Scoring” in which 1,581 responses (TOEFL, GRE and Praxis) previously scored by trained human raters and ETS’s e-rater (which scores responses based on features they contain) were submitted to GPT4 for scoring.  GPT4 was provided the response, the question and the rubric.  The scores produced were compared to those provided earlier by e-rater.  In an earlier draft of this post I attempted to summarize the results.  Alas, that is somewhat beyond my meager abilities.  Do check out the article for yourself (beginning on page 20).

Also mentioned is the possibility of combining a human rater’s score with both of the AI scores, and (more interestingly) the possibility of replacing human raters with GPT4.  But, as is mentioned by the authors, “In this case, the three tests are all high stakes and the evidence is too weak to support the use of these scores in operational score reporting unless they are used in combination with e-rater scores and/or human ratings.”  And furthermore:  “Based on the small sample sizes, the concordance with human ratings was borderline, especially for the TOEFL task. e-rater outperformed GPT4 for the three tests (GRE, TOEFL, Praxis). In these cases, without additional evidence we would retain the e-rater model.”

In a recent interview with the Free Press Journal, ETS India head Sachin Jain said that ETS aims to provide TOEFL score reports in just 2 days.  That’s good news for test takers.  A typical TOEFL test taker in India is someone who is also planning to take the GRE. Consequently, the TOEFL sometimes gets shunted aside and left to the last minute.  This sort of test taker will appreciate faster results quite a lot. It will also be good for business, as the other tests in this category have provided two-day results for quite some time.

Jain also mentioned that sometime in Q1 of 2025, ETS India will start providing a free “intermediate guide to the TOEFL” to test takers.  That’s also good news.  I really like the beginner’s guide which is currently provided… but as I noted in my review it is a somewhat skimpy offering compared to what people taking the IELTS in India currently receive at no cost.

TOEFL’s “Enhanced Score Reporting” is no longer available. That was launched with some fanfare in October of last year and represented an effort to provide test takers with more than a numerical score. It included feedback about which reading and listening question types they did well (and not well) at and insights into the grammar, language use and mechanics of their speaking and writing responses. Along with that, sample high-scoring responses were provided.

I liked the enhanced score reporting, as it had been something I spent four or five years asking for.

A few days ago, I wrote about the option of paying $99 to get your TOEFL score in 24 hours.  But I’m not seeing that option anymore. I’ve asked a few folks to check and they aren’t seeing it either. Maybe it was just a test run.

Anyway, among individuals I talked to the initial reaction to the offer wasn’t great. Some people* grumbled that ETS is again charging for a service that other test makers provide at no cost. Note that IELTS scores now arrive in 1-2 days at no extra charge.

*not me!!!

It is now possible to get TOEFL scores in 24 hours. Test takers who pay a $99 $149 fee during registration can benefit from this “express scoring” option.  Results normally arrive in 4-8 days.  Right now, it seem to only be available for tests taken at a test center… not tests taken at home.

Update, from December 22:  I don’t see it anymore.  Maybe this service has been removed.

Update, from February 9: Express scoring is back.  It now costs $149.

Below is the promotional image I saw during the registration process today.

According to reports that rolled in last week, the Educational Testing Service (ETS) has begun training individuals from outside the USA to score TOEFL test taker responses and to serve as scoring leaders.

This seems to represent something of a shift as far as the TOEFL scoring process goes.  To date, responses have been scored solely by individuals physically located in the USA (and in possession of a degree from an American university).  It is unclear at this time which countries the new raters will be located in.

Update:  For a little more confirmation, head over to the ETS Glassdoor page.

ETS has just published its “Test and Score Data Summary” for 2021.  This document contains a ton of valuable information, including average scores (and section scores) overall and in specific countries.

The average TOEFL score is now 88 points.  That’s an increase of one point since last year.

Here’s a history of the average score since the test began:

  • 2006: 79
  • 2007: 78
  • 2008: 79
  • 2009: 79
  • 2010: 80
  • 2011 (not available)
  • 2012 (not available)
  • 2013: 81
  • 2014: 80
  • 2015: 81
  • 2016: 82
  • 2017: 82
  • 2018: 83 
  • 2019: 83
  • 2020: 87
  • 2021: 88

As you can see, this year’s jump is not as wild as in 2020, but a one point increase is still significant.

Here’s how the section scores changed (compared to last year):

  • The average TOEFL reading score is now 22.4 (+0.2)
  • The average TOEFL listening score is now 22.6 (+.03)
  • The average TOEFL speaking score is now 21.1 (-0.1)
  • The average TOEFL writing score is now 21.6 (+0.1)

I pay special attention to trends in a few key markets.  I noticed that in all of the countries I track, the average score is unchanged.

  • The average score in China is 87 
  • The average score in Korea is 86 
  • The average score in Japan is 73 
  • The average score in Brazil is 90 
  • The average score in India 96 
  • The average score in the USA is 93

At first glance, it seems like the overall score increase is due to smaller markets “catching up” to the increases in the rest of the world that were observed in last year’s numbers.

 

Today I want to pass along a few details from the Virtual Seminar for English Language Teachers hosted by ETS last night.

One of the presenters provided new details about how the speaking questions are scored. He prefaced these details by sharing a sample type 3 speaking question.  Here is the reading part:

And here is a transcript of the listening part:

 

A standard question, right?

Then we were shown the “answer sheet” that is given to raters so they know how to assess topic development.  This is new information.  Here it is:

That’s interesting, right?  My assessment of this is that for an answer to receive a full score for topic development it must explicitly or implicitly reference the term and its definition.  It must also broadly summarize the example.  And then it must include just two of the four main details given in the example.  The last part is new to me.  Generally, I push students to include all of the details.  Perhaps I should reassess my teaching methods.

There you go, teachers.  Some new information about the TOEFL… in 2022.

A few questions remain:

  • Is this always the case?  Will there always be four main details in the example?  Will we always need to include just two of them?  Probably not.  Surely there are cases where more than two details are required.
  • How does this work in lectures which have two totally unique examples?  Often the reading is about a biological feature in animals, while the lecture describes two different animals that have this feature.  Is it okay to ignore one of them?  Probably not.
  • Can any of this learning be applied to TOEFL Speaking question four?  Probably not.
  • Does order matter?  Probably not.

 

 

 

The ETS research department has a report about whether essays rated on desktop computers get different scores than those rated on iPads.  The result of the research: they get the same scores.

This is why I love ETS.  Most organizations would just tell the raters to use whatever they want.  ETS, though, studied the matter very carefully.  I respect that.

As always, readers will find the most interesting details buried deep in the report. 

In this case, we learn that the raters participating in the study scored GRE essays.  We also learn that they scored twenty essays in one hour.  That’s three minutes per essay… including the time needed to queue up each essay, read it, click a button to submit the score and blink a few times before moving on to the next one. The report also indicates:

The 20 essay ratings per device were only a fraction of the number of essays a typical rater would score in a day

That tracks with what I’ve heard from former ETS raters, but I don’t think I’ve ever seen it in print. Obviously, the time taken to score TOEFL essays could be longer, but I suspect the workflow is similar.

I don’t know if this is useful information, but it is always nice to get a peek behind the curtain.  Now and then.

 

 

As I mentioned earlier, the TOEFL writing rubrics are notoriously difficult to understand. Perhaps the most difficult part is the requirement that score-five and score-four independent essays demonstrate “syntactic variety” and that score-three essays include a “limited range of syntactic structures.”

What the heck is syntactic variety?  What is a syntactic structure?

Here’s what you should know:

Often I see essays that are quite long and have perfect grammar.  But I still can’t give them a perfect score.  This is because the sentences and clauses are all very similar.  Sometimes the student just uses simple sentences.  Sometimes they use too many compound sentences. Sometimes every sentence starts with a transitional adverb.  Sometimes every sentence starts with a pronoun.  That kind of writing is boring and lacks variety.

Syntax is the arrangement of words into sentences, clauses and phrases.  We don’t just put words anywhere.  They have to be arranged properly to convey meaning, and for our sentences to be considered correct.  Of course you know that.

Syntactic variety” refers to the use of various types of sentences, clauses and phrases. 

Sentence Types

The best way to ensure that your TOEFL essay has syntactic variety is to use the three main sentence types in English: simple, compound, and complex sentences.  You may already be familiar with these.  If not, start studying.

Simple sentences look like this:

Simon took the math test.  He was totally unprepared for it.

Compound sentences look like this:

Simon took the math test, but he was totally unprepared for it.

Complex sentences look like this:

Even though Simon took the math test, he was totally unprepared for it.

Note that complex sentences seem to be most important for the purposes of establishing syntactic variety and complexity.

Beyond Sentence Types – Noun Clauses, Adverb Clauses and Adjective Clauses

You can further increase your syntactic variety through the use of noun, adverb and adjective clauses.

Noun Clauses

A noun clause is a group of words that functions like a noun. They often start with “how” or a “wh-” word.  Like:

Why she didn’t call me is a mystery.

What I did that day surprised my family.

She listened to whatever I suggested.

These demonstrate more variety and complexity than writing:

That is a mystery.

This surprised my family.

She listened to my ideas.

Placing a noun clause in the subject position of a sentence may be considered a sign of more mature and complex writing.

Adverb Clauses

An adverb clause is a group of words that functions as an adverb.  Like adverbs, they usually describe how we do things.  Like:

With great enthusiasm, I finished the project.

Before doing anything else, Matthew turned on his computer.

These are a bit more impressive than:

“Quickly, I finished the project.”

“Eagerly, Matthew turned on his computer.”

Adjective Clauses

An adjective clause (also called a relative clause) is a group of words that functions like an adjective.  It describes a noun in a sentence.  Like:

“The test, which I have taken five times, is extremely difficult.”

“My friend Simone, who is three years older than me, is currently a university freshman.”

Don’t go Crazy

Remember that your essay might only be 20 sentences in total.  You don’t have to do all of these things.  Just include a few compound sentences and a few complex sentences.  Try to work in a few of the above clauses along the way.  

Other Things

There are other ways to achieve syntactic variety. Standardized tests that have a more human touch explicitly mention some of them in their grading rubrics.  Consider the ALP Essay Test from Colombia University, which specifically mentions such techniques as:

  • Inversion
  • Noun clauses in subject position
  • adverb/adjective/noun clauses
  • Appositives
  • Parallelism 

 

The TOEFL writing rubrics are famously difficult to understand. Even experienced teachers have a hard time turning them into something that students can actually make use of. Today’s blog post will kick off a series that attempts to explain what the rubrics actually refer to.  Starting with…

Idiomaticity

References in the rubric to “idiomatic word choice” and “idiomatic language” are particularly difficult to grasp. The rubric says that a score-five independent essay should contain “idiomatic word choice.” Meanwhile, it notes that a score-three might have “noticeable” errors in the “use of idiomatic language.”

But what does this actually mean?

Many students (and teachers) think that ETS wants test-takers to use idioms like “it was raining cats and dogs last week” or “I won’t beat around the bush.” That is not correct. That’s a different matter.

“Idiomaticity” is tough to define, but in this sort of context it typically refers to “the extent to which a learner’s language resembles that of a native speaker.”

This is what your teachers are hinting at when they change one of your sentences not because of a specific grammar error, but because they think some of your word choices don’t seem natural.

Here’s a sentence I recently read:

“Business owners want employees to make quick decisions, which renders stress for those who take their time.”

There aren’t any grammar errors in that sentence. But “renders” sounds weird to me. Changing that to “causes” or “creates” will increase the idiomaticity of the sentence.

Here’s another one:

“When the shopping mall opened, many local shops ceased their business.”

That’s a lot more subtle. “Ceased their business” is pretty good, but it is a little bit awkward. A native speaker would probably say something like “went out of business.”

I would even complain about something like:

“I strongly think that children should attend all of their classes”

My preferred phrasing would be something like “I strongly believe that…”.

You might think I’m being needlessly picky, but to get a perfect score (5 on the rubric, 30 scaled) you need to use the best possible words at all times.

In TOEFL essays, problems related to idiomaticity seem to come from two sources:

  1. Inexperience with the language.
  2. A desire to shove a lot of fancy words into the essays to get a higher score.

The first source is normal. No one is perfect. You can overcome this by studying. Read sample essays. Get feedback on your own writing. Try studying with a collocations book like “Collocations in Use.” Consider using a learner’s dictionary.

The second source is not normal. Ignore advice from tutors who tell you that using obscure words will increase your score. They won’t. Sentences in some of the essays I’ve read come pretty close to Noam Chomsky’s famous “colorless green ideas sleep furiously.” That’s a beautiful sentence, but no meaning can be derived from it.