TOEFL score data for 2018 is now available.  Download a PDF right here.

The most notable bit of data is that the mean score of all test takers reached 83 points for the first time, after being stuck at 82 points for two years.

Here is a short history of mean score progression for a few selected dates.  Note that the mean score of the TOEFL iBT has increased by four points over the life of the test.  It also seems to be increasing more rapidly than before these days.  That probably accounts for the “required score creep” that bugs a lot of students.

  • 2006: 79
  • 2007: 78
  • 2008: 79
  • 2009: 79
  • 2010: 80
  • 2013: 81
  • 2014: 80
  • 2015: 81
  • 2016: 82
  • 2017: 82
  • 2018: 83 

Note that the data summaries from 2011 and 2012 don’t contain an overall mean score, as far as I can tell.

Score recipients have revised their requirements to keep up with these increases, which represent a challenge for all students. 

What makes this a challenge for some students more than others is that this increase is likely driven by huge jumps in countries with well-developed test preparation industries (and tons of test-takers).  For example, the mean score in Korea has jumped twelve points since 2006.  Korea has the absolute best TOEFL preparation options in the world, and it shows.  Here are scores from Korea for a few selected years:

  • 2006: 72
  • 2007: 77
  • 2010: 81
  • 2014: 84
  • 2017: 83
  • 2018: 84 

Meanwhile, scores in Taiwan have jumped 11 points:

  • 2006: 71
  • 2007: 72
  • 2010: 76
  • 2014: 80
  • 2017: 81
  • 2018: 82 

It is worth noting that scores in China have increased less dramatically, rising only four points from 76 to 80 between 2006 to 2018.  As has been pointed out elsewhere, China has a consistency problem when it comes to the test prep industry.  They have some of the best options for students… but some of the worst as well. It seems like things are improving for Chinese students, though, as China is likely the source of more recent increases to the overall mean score. Note that most of China’s increase has come since 2014.

In contrast to China’s recent growth, it is worth noting that the mean score in Korea has remained about the same since 2014. This indicates that there is a limit to the benefits that students can gain from research into test design and scoring. I imagine that the mean score in Taiwan will probably top out around the same level in a few years.

Once China reaches that level as well,  ETS should probably start developing the “next generation” TOEFL to replace the iBT. If there are too many teachers around the world who can show students how to “beat” the test and score way above their actual level the reliability of the iBT will be called into question.

For fun, here is the growth in a few notable countries from 2006 to 2018

  • Germany: 96 to 98
  • Brazil: 85 to 87
  • Japan: 65 to 71
  • Russia: 85 to 87
  • Iran: 78 to 85
  • India: 91 to 95

In case you are curious, the top performing countries in 2018 were Netherlands and Switzerland.  The mean score in both of those countries was 99 points.  

Today I want to write a few words about an interesting new (December, 2019) text from ETS.  “Automated Speaking Assessment” is the first book-length study of SpeechRater, which is the organization’s automated speaking assessment technology.  That makes it an extremely valuable resource for those of us who are interested in the TOEFL and how our students are assessed.  There is little in here that will make someone a better TOEFL teacher, but many readers will appreciate how it demystifies the changes to the TOEFL speaking section that were implemented in August of 2019 (that is, when the SpeechRater was put into use on the test).

I highly recommend that TOEFL teachers dive into chapter five of the book, which discusses the scoring models used in the development of SpeechRater.  Check out chapter four as well, which discusses how recorded input from students is converted into something that can actually be graded.

Chapters six, seven and eight will be the most useful for teachers.  These discuss, in turn:  features measuring fluency and pronunciation, features measuring vocabulary and grammar, and features measuring content and discourse coherence.  Experienced teachers will recognize that these three categories are quite similar to the published scoring rubrics for the TOEFL speaking section. 

In chapter six readers will learn about how the SpeechRater measures the fluency of a student by counting silences and disfluencies.  They will also learn about how it handles speed, chunking and self-corrections.  These are actually things that could influence how they prepare students for this section of the test, though I suspect that most teachers don’t need a book to tell them that silences in the middle of an answer are a bad idea.  There is also a detailed depiction of how the technology judges pronunciation, though that section was a bit to academic for me to grasp.

Chapter seven discusses grammar and vocabulary features that SpeechRater checks for.  Impressively, it just sticks them in a list.  A diligent teacher might create a sort of check list to provide to students. Finally, chapter eight discusses how the software assesses topic development in student answers.

Sadly, this book was finished just before ETS started using automated speaking scoring on high-stakes assessment.  Chapter nine discusses how the technology is used to grade TOEFL practice tests (low-stakes testing), but nothing is mentioned about its use on the actual TOEFL.  I would really love to hear more about that, particularly its ongoing relationship with the human raters who grade the same responses.

Schools Accepting TOEFL MyBest Scores

Important Update from 2020: ETS is now maintaining its own list of schools and organization that accept TOEFL MyBest Scores.  I probably won’t update my own list anymore.  You can find the official list as a PDF file right here.

The following institutions have stated publicly that they will accept TOEFL MyBest Scores. Note that this list could be out of date. It is best to contact the school you are interested in directly.

Yale Graduate School of Arts and Sciences. Source: “If you wish to send us “MyBest Scores”, we will accept them. All TOEFL scores we receive will be made available to the program reviewing your application. “

Miami University. Source: “We accept MyBest scores for the TOEFL. This means that the highest scores for each section from different TOEFL exams will determine a combined highest sum score.”

Carnegie Mellon School of Design. Source: “the School of Design also accepts MyBest scores for TOEFL iBT. “

Shoreline Community College. Source: “MyBest scores are accepted.

University of British Columbia College of Graduate Studies. Source: “The College of Graduate Studies accepts MyBest Scores.”

Northwestern (Graduate School). Source: “GS accepts the “MyBest scores”. A new reporting structure released by ETS in August 2019. These scores may be entered in the TOEFL section on the “Test Scores” page of the application form.”

University of Arizona (Graduate College). Source: “Individual MyBest scores must also be dated within 2 years of the enrollment term to be considered valid.”

University of Buffalo. Source.

CalArts. Source: “CalArts accepts “MyBest” scores delivered directly from ETS.”

San Francisco Conservatory of MusicSource: “SFCM will consider accepting the MyBest scores. We must have all score reports the MyBest scores are from submitted with the application, and the scores must be from within the past two years.”

 

This week I was lucky enough to again have an opportunity to attend a workshop hosted by ETS for TOEFL teachers.  Here is a quick summary of some of the questions that were asked by attendees of the workshop.  Note that the answers are not direct quotes, unless indicated.

 

Q:  Are scores adjusted statistically for difficulty each time the test is given?

A: Yes.  This means that there is no direct conversion from raw to scaled scores in the reading and listening section.  The conversion depends on the performance of all students that week.

 

Q: Do all the individual reading and listening questions have equal weight?

A: Yes.

 

Q:  When will new editions of the Official Guide and Official iBT Test books be published?

A:  There is no timeline.

 

Q:  Are accents from outside of North America now used when the question directions are given on the test?

A: Yes.

 

Q:  How are the scores from the human raters and the SpeechRater combined?

A:  “Human scores and machines scores are optimally weighted to produce raw scores.”  This means ETS isn’t really going to answer this question.

 

Q: Can the human rater override the SpeechRater if he disagrees with its score?

A: Yes.

 

Q:  How many different human raters will judge a single student’s speaking section?

A:  Each question will be judged by a different human.

 

Q:  Will students get a penalty for using the same templates as many other students?

A:   Templates “are not a problem at all.”

 

Q: Why were the question-specific levels removed from the score reports?

A: That information was deemed unnecessary.

 

Q:  Is there a “maximum” word count  in the writing section?

A:  No.

 

Q:  Is it always okay to pick more than one choice in multiple choice writing prompts?

A:  Yes.

I was able to ask a few more questions at an ETS webinar. Here’s what I learned (the answers are not direct quotes):

Q: Will results come back in six calendar days or six business days now?
A: Six calendar days.

Q: How significant are pauses when students are answering questions in the speaking section?
A: They can be very significant and can affect the score a lot.

Q: Could the same human grader score all four speaking responses?
A: No.

Q: Will a new Official Guide be published in 2019?
A: No. That has not been prioritized.

Q: Could students get only NINE reading questions with a specific reading passage?
A: Yes. This will happen if a fill-in-a-table question is given.

Q: Is it okay to mention the reading first in integrated essay body paragraphs?
A: The order “does not matter.” The scoring rubric is “not that structured.”

At the 2019 TOEFL iBT Seminar in Seoul on September 5, ETS announced details of the new “Enhanced Speaking Scoring” for the TOEFL, which has actually been in place since August 1, 2019.

In the past, speaking responses were graded by two human graders. Now, however, speaking responses are graded by one human grader along with the SpeechRater software. This software is a sort of AI that can evaluate human speech, and has been used by ETS for various tasks since about 2008. Most notably, it provided score estimates for the “TOEFL Practice Online” tests they sell to students.

According to ETS:

“From August 1, 2019, all TOEFL iBT Speaking responses are rated by both a human rater and the SpeechRater scoring engine.”

They also note:

“Human raters evaluate content, meaning, and language in a holistic manner. Automated scoring by the SpeechRater service evaluates linguistic features in an analytic manner.”

To elaborate (and this is not a quote), ETS indicated than the human scorer will check for meaning, content and language use, while the SpeechRater will check pronunciation, accent and intonation.

It is presently unknown how the human and computer scores will be combined to create a single overall score, but looking at the speaking rubric could provide a few hints. Note that in the past the human raters would assess three categories of equal weight: delivery, language use, and topic development. If the above information is accurate, the SpeechRater now assesses delivery, while the human now assess language use and topic development. It is possible, then, that the SpeechRater provides 1/3 of the score, and than the human rater provides the other 2/3.

I will provide more information as I get it. In the meantime, check out the following video for more news and speculation.