At least once a week, a student asks me if they should get a TOEFL score review (for speaking or writing).   The short answer is: “probably not.”  The long answer is:  “if the cost is no problem, you should do it.”

However, if you want more information, here’s what you should note:

  1. In my experience, a score review results in an increase about 10% of the time.  The rest of the time, the score stays the same.  In rare cases, the score goes down.  This rate is the same for both speaking and writing.
  2. However, I have heard from some students who have gotten large increases, up to four points.  I can’t really explain this.
  3. During a score review, the e-rater (for writing) and SpeechRater (for speaking) are not used.  This isn’t published information, but ETS has confirmed it when asked.  This means that score reviews are done entirely by humans.  If you think that you were punished by the automated scoring systems, you might want to request a review.
  4. The person doing the score review will not see your original score.  They will not be biased by your old score, but they will probably be aware that they are doing a score review.

In terms of money and timing:

  1. The score review is really expensive.  Eighty dollars for one section.  However, if your score changes this money will be returned to you.  This is not a published policy, though, so it could change at any time.
  2. The score review usually takes about one week, but it could take longer.

Note that most students are not able to request a score review since it is not possible if your scores have been sent to institutions.  If you selected some free score recipients when you signed up for the test they will get the scores right away, and thus you will not be able to request the score review.

It’s a great day, everybody!  The TOEFL Test and Score Data Summary for 2019 is available!

These annual reports provide valuable data about test taker performance.  While this year’s figures are similar to last year’s figures, the following data points were mildly interesting to me:

  • The overall mean (average) score is still 83.  But that figure is rounded, and it looks like there was still a significant fractional increase this year.
  • The mean reading score is now 21.2 (+.4)
  • The mean listening score is now 20.9 (+.3)
  • The mean speaking score is now 20.6 (+.1)
  • The mean writing score is now 20.5 (-.2)

It is interesting that the writing score has decreased.  That may represent an ongoing trend.  Here are writing scores since 2010:

  • 2019: 20.5
  • 2018: 20.7
  • 2017: 20.8
  • 2016: 20.9
  • 2015: 20.6
  • 2014: 20.3
  • 2013: 20.6
  • 2012: No data
  • 2011: No data
  • 2010: 20.7

Some students do claim that the writing section has been getting more difficult in recent years.  They may be correct about that, but it looks like the test was really challenging back in 2014.  And it is exactly where it was a decade ago.

Interestingly, the other sections are all up since 2010.  Some by a lot:

  • Reading: 20.1 –> 21.2
  • Listening: 19.5 –> 20.9
  • Speaking: 20.0 –> 20.6

It is also worth noting that the use of automated speaking scoring does not appear to have affected average speaking scores, but that technology was only used during the last five months of 2019.


As always, it seems like a lot of the overall increase in scores is coming from the test-prep powerhouses of East Asia.  Scores in China are +1 (to 81), scores in Japan are +1 (to 72) and scores in Taiwan are +1 (to 83).  However, scores in Korea are -1 (to 83).

Scores in the key markets of Brazil (87) and India (95) are unchanged.

I would love to see which countries have the most test-takers, but I suspect that information is confidential. 

The highest scoring country is now Austria, where the average score is 100.


Women still outperform men in listening, speaking and writing. 

 

TOEFL score data for 2018 is now available.  Download a PDF right here.

The most notable bit of data is that the mean score of all test takers reached 83 points for the first time, after being stuck at 82 points for two years.

Here is a short history of mean score progression for a few selected dates.  Note that the mean score of the TOEFL iBT has increased by four points over the life of the test.  It also seems to be increasing more rapidly than before these days.  That probably accounts for the “required score creep” that bugs a lot of students.

  • 2006: 79
  • 2007: 78
  • 2008: 79
  • 2009: 79
  • 2010: 80
  • 2013: 81
  • 2014: 80
  • 2015: 81
  • 2016: 82
  • 2017: 82
  • 2018: 83 

Note that the data summaries from 2011 and 2012 don’t contain an overall mean score, as far as I can tell.

Score recipients have revised their requirements to keep up with these increases, which represent a challenge for all students. 

What makes this a challenge for some students more than others is that this increase is likely driven by huge jumps in countries with well-developed test preparation industries (and tons of test-takers).  For example, the mean score in Korea has jumped twelve points since 2006.  Korea has the absolute best TOEFL preparation options in the world, and it shows.  Here are scores from Korea for a few selected years:

  • 2006: 72
  • 2007: 77
  • 2010: 81
  • 2014: 84
  • 2017: 83
  • 2018: 84 

Meanwhile, scores in Taiwan have jumped 11 points:

  • 2006: 71
  • 2007: 72
  • 2010: 76
  • 2014: 80
  • 2017: 81
  • 2018: 82 

It is worth noting that scores in China have increased less dramatically, rising only four points from 76 to 80 between 2006 to 2018.  As has been pointed out elsewhere, China has a consistency problem when it comes to the test prep industry.  They have some of the best options for students… but some of the worst as well. It seems like things are improving for Chinese students, though, as China is likely the source of more recent increases to the overall mean score. Note that most of China’s increase has come since 2014.

In contrast to China’s recent growth, it is worth noting that the mean score in Korea has remained about the same since 2014. This indicates that there is a limit to the benefits that students can gain from research into test design and scoring. I imagine that the mean score in Taiwan will probably top out around the same level in a few years.

Once China reaches that level as well,  ETS should probably start developing the “next generation” TOEFL to replace the iBT. If there are too many teachers around the world who can show students how to “beat” the test and score way above their actual level the reliability of the iBT will be called into question.

For fun, here is the growth in a few notable countries from 2006 to 2018

  • Germany: 96 to 98
  • Brazil: 85 to 87
  • Japan: 65 to 71
  • Russia: 85 to 87
  • Iran: 78 to 85
  • India: 91 to 95

In case you are curious, the top performing countries in 2018 were Netherlands and Switzerland.  The mean score in both of those countries was 99 points.  

Today I want to write a few words about an interesting new (December, 2019) text from ETS.  “Automated Speaking Assessment” is the first book-length study of SpeechRater, which is the organization’s automated speaking assessment technology.  That makes it an extremely valuable resource for those of us who are interested in the TOEFL and how our students are assessed.  There is little in here that will make someone a better TOEFL teacher, but many readers will appreciate how it demystifies the changes to the TOEFL speaking section that were implemented in August of 2019 (that is, when the SpeechRater was put into use on the test).

I highly recommend that TOEFL teachers dive into chapter five of the book, which discusses the scoring models used in the development of SpeechRater.  Check out chapter four as well, which discusses how recorded input from students is converted into something that can actually be graded.

Chapters six, seven and eight will be the most useful for teachers.  These discuss, in turn:  features measuring fluency and pronunciation, features measuring vocabulary and grammar, and features measuring content and discourse coherence.  Experienced teachers will recognize that these three categories are quite similar to the published scoring rubrics for the TOEFL speaking section. 

In chapter six readers will learn about how the SpeechRater measures the fluency of a student by counting silences and disfluencies.  They will also learn about how it handles speed, chunking and self-corrections.  These are actually things that could influence how they prepare students for this section of the test, though I suspect that most teachers don’t need a book to tell them that silences in the middle of an answer are a bad idea.  There is also a detailed depiction of how the technology judges pronunciation, though that section was a bit to academic for me to grasp.

Chapter seven discusses grammar and vocabulary features that SpeechRater checks for.  Impressively, it just sticks them in a list.  A diligent teacher might create a sort of check list to provide to students. Finally, chapter eight discusses how the software assesses topic development in student answers.

Sadly, this book was finished just before ETS started using automated speaking scoring on high-stakes assessment.  Chapter nine discusses how the technology is used to grade TOEFL practice tests (low-stakes testing), but nothing is mentioned about its use on the actual TOEFL.  I would really love to hear more about that, particularly its ongoing relationship with the human raters who grade the same responses.

Schools Accepting TOEFL MyBest Scores

Important Update from 2020: ETS is now maintaining its own list of schools and organization that accept TOEFL MyBest Scores.  I probably won’t update my own list anymore.  You can find the official list as a PDF file right here.

The following institutions have stated publicly that they will accept TOEFL MyBest Scores. Note that this list could be out of date. It is best to contact the school you are interested in directly.

Yale Graduate School of Arts and Sciences. Source: “If you wish to send us “MyBest Scores”, we will accept them. All TOEFL scores we receive will be made available to the program reviewing your application. “

Miami University. Source: “We accept MyBest scores for the TOEFL. This means that the highest scores for each section from different TOEFL exams will determine a combined highest sum score.”

Carnegie Mellon School of Design. Source: “the School of Design also accepts MyBest scores for TOEFL iBT. “

Shoreline Community College. Source: “MyBest scores are accepted.

University of British Columbia College of Graduate Studies. Source: “The College of Graduate Studies accepts MyBest Scores.”

Northwestern (Graduate School). Source: “GS accepts the “MyBest scores”. A new reporting structure released by ETS in August 2019. These scores may be entered in the TOEFL section on the “Test Scores” page of the application form.”

University of Arizona (Graduate College). Source: “Individual MyBest scores must also be dated within 2 years of the enrollment term to be considered valid.”

University of Buffalo. Source.

CalArts. Source: “CalArts accepts “MyBest” scores delivered directly from ETS.”

San Francisco Conservatory of MusicSource: “SFCM will consider accepting the MyBest scores. We must have all score reports the MyBest scores are from submitted with the application, and the scores must be from within the past two years.”

 

This week I was lucky enough to again have an opportunity to attend a workshop hosted by ETS for TOEFL teachers.  Here is a quick summary of some of the questions that were asked by attendees of the workshop.  Note that the answers are not direct quotes, unless indicated.

 

Q:  Are scores adjusted statistically for difficulty each time the test is given?

A: Yes.  This means that there is no direct conversion from raw to scaled scores in the reading and listening section.  The conversion depends on the performance of all students that week.

 

Q: Do all the individual reading and listening questions have equal weight?

A: Yes.

 

Q:  When will new editions of the Official Guide and Official iBT Test books be published?

A:  There is no timeline.

 

Q:  Are accents from outside of North America now used when the question directions are given on the test?

A: Yes.

 

Q:  How are the scores from the human raters and the SpeechRater combined?

A:  “Human scores and machines scores are optimally weighted to produce raw scores.”  This means ETS isn’t really going to answer this question.

 

Q: Can the human rater override the SpeechRater if he disagrees with its score?

A: Yes.

 

Q:  How many different human raters will judge a single student’s speaking section?

A:  Each question will be judged by a different human.

 

Q:  Will students get a penalty for using the same templates as many other students?

A:   Templates “are not a problem at all.”

 

Q: Why were the question-specific levels removed from the score reports?

A: That information was deemed unnecessary.

 

Q:  Is there a “maximum” word count  in the writing section?

A:  No.

 

Q:  Is it always okay to pick more than one choice in multiple choice writing prompts?

A:  Yes.

I was able to ask a few more questions at an ETS webinar. Here’s what I learned (the answers are not direct quotes):

Q: Will results come back in six calendar days or six business days now?
A: Six calendar days.

Q: How significant are pauses when students are answering questions in the speaking section?
A: They can be very significant and can affect the score a lot.

Q: Could the same human grader score all four speaking responses?
A: No.

Q: Will a new Official Guide be published in 2019?
A: No. That has not been prioritized.

Q: Could students get only NINE reading questions with a specific reading passage?
A: Yes. This will happen if a fill-in-a-table question is given.

Q: Is it okay to mention the reading first in integrated essay body paragraphs?
A: The order “does not matter.” The scoring rubric is “not that structured.”

At the 2019 TOEFL iBT Seminar in Seoul on September 5, ETS announced details of the new “Enhanced Speaking Scoring” for the TOEFL, which has actually been in place since August 1, 2019.

In the past, speaking responses were graded by two human graders. Now, however, speaking responses are graded by one human grader along with the SpeechRater software. This software is a sort of AI that can evaluate human speech, and has been used by ETS for various tasks since about 2008. Most notably, it provided score estimates for the “TOEFL Practice Online” tests they sell to students.

According to ETS:

“From August 1, 2019, all TOEFL iBT Speaking responses are rated by both a human rater and the SpeechRater scoring engine.”

They also note:

“Human raters evaluate content, meaning, and language in a holistic manner. Automated scoring by the SpeechRater service evaluates linguistic features in an analytic manner.”

To elaborate (and this is not a quote), ETS indicated than the human scorer will check for meaning, content and language use, while the SpeechRater will check pronunciation, accent and intonation.

It is presently unknown how the human and computer scores will be combined to create a single overall score, but looking at the speaking rubric could provide a few hints. Note that in the past the human raters would assess three categories of equal weight: delivery, language use, and topic development. If the above information is accurate, the SpeechRater now assesses delivery, while the human now assess language use and topic development. It is possible, then, that the SpeechRater provides 1/3 of the score, and than the human rater provides the other 2/3.

I will provide more information as I get it. In the meantime, check out the following video for more news and speculation.