Well, I reported a few days ago  on the impressive increase to the mean TOEFL score found in the data released by ETS.  I expressed some puzzlement at the increase, as it is pretty huge.  I’m still not entirely certain why it happened, but after talking it out with some experts, my conclusions are:

  1. The change is mostly due to the shorter test.  I guess the shorter version is “easier.” While the reported mean score did not change in 2019, that was partly because of rounding, and a drop in the mean writing score. If we look carefully there were fractional increases in 2019 which hint at a trend.
  2. ETS may have adjusted the e-rater which scores essays. That’s a normal thing. I think they are on iteration 19 or something like that. I suspect that caused writing scores to increase. That makes up 25% of the overall increase… but the shorter test should have no effect on it.  Perhaps they wanted to address the long-term drop in average writing scores.
  3. The increase is caused in large part by China (presumably the number one TOEFL market) and Korea (presumably the number two TOEFL market). Increases in the mean score probably reflect advances in preparation techniques in those countries. Coincidentally I spent the month before the score data release reporting on those advances.

Let me know what you think in the comments.

TOEFL Score data for 2020 is available!  As regular readers of the blog will know, this is my favorite day of the year!  You can download your copy from ETS.

Scores are way up this year.  I don’t know why.

The overall mean (average) score is now 87.  That is an increase of four points, which is quite a big jump.  Here’s the history of the average TOEFL score:

  • 2006: 79
  • 2007: 78
  • 2008: 79
  • 2009: 79
  • 2010: 80
  • 2011 (not available)
  • 2012 (not available)
  • 2013: 81
  • 2014: 80
  • 2015: 81
  • 2016: 82
  • 2017: 82
  • 2018: 83 
  • 2019: 83
  • 2020: 87

As you can see, it took thirteen years for the average score to increase from 79 to 83.  That jump was replicated in 2020 alone.

Obviously this year there are also large jumps in the section scores:

  • The mean reading score is now 22.2 (+1.0)
  • The mean listening score is now 22.3 (+1.4)
  • The mean speaking score is now 21.2 (+.6)
  • The mean writing score is now 20.5 (+1.0)

Last year, the section score changes were much smaller. They were (respectively): +.4, +.3, +.1, -.2.

The jumps in 2020 alone are comparable to the jumps I recorded in the nine years from 2010 to 2019.

As you guys know, I like to study geographic trends, particularly those in China, Korea and Japan.  Here’s what I spotted:

  • The mean score in Korea is now 86 (+3)
  • The mean score in China is now 87 (+6) !!!
  • The mean score in Japan is now 73 (+1)
  • The mean score in Taiwan is now 85 (+2)

I must point out that in the thirteen years between 2006 and 2019 the average score in China increased by five points.  In 2020 alone the increase was six points.

Scores in other key markets have increased as well:

  • The mean score in Brazil is now 90 (+3)
  • The mean score in India is now 96 (+1)
  • The mean score in the United States is now 93 (+2)

The top performing country this year is Austria, with an average score of 102 (+2)

It appears that China is driving much of the overall increase.  In case you are curious, the section increases there are: Reading + 2, Listening + 2,  Speaking no change, Writing +2.

I’m going to do some more digging and some more calling in the weeks ahead.  I want to know more about these dramatic changes.

ETS has just uploaded a chart to convert between TOEFL iBT and TOEFL Essentials scores.  I’ve copied it here for you, but be sure to check out the main TOEFL Essentials Page for more information, including conversion charts for each section of the test.

Soon I will start a list of schools that accept the test, and I will maintain it until ETS publishes their own list.

Hey, here’s something really amazing.

ETS has created a new subsidiary called EdAgree.  EdAgree is described as

…an advocate for international students providing a path to help students identify universities that will push them towards longer term success. We help you put your best foot forward during the admissions process and support you throughout your study abroad and beyond. 

As part of this mission, they provide free English speaking practice using the same SpeechRater technology that is used to grade the TOEFL!  

To access this opportunity, register for a free account on EdAgree.  After that, look for the “English Speaking Practice” button in the student dashboard.  The screenshot is from the desktop version, but it also works on mobile.

This section provides a complete set of four TOEFL speaking questions.  After you answer them, you’ll get a SpeechRater score in several different categories (pause frequency, distribution of pauses, repetitions, rhythm, response length, speaking rate, sustained speech, vocabulary depth, vocabulary diversity, vowels).  These categories are used on the real TOEFL to determine your score!  You can also listen to recordings of your answers.  Note that your responses are scored collectively, rather than individually.  That means, for example, that you get a “pause frequency” score for how you answered all four questions, and not a separate “pause frequency” score for each individual answer.

Update: The list of above categories has been revised a few times, as EdAgree has tweaked the tool.

Note that you will get fresh questions every five days.  I do not know how many unique sets there are in total.  Keep visiting and let me know.  However, you can repeat the same questions as many times as you wish.

I took a set a few days ago, and the questions were pretty good.  They weren’t 100% the same as the real TOEFL, but they were better than what is found in most textbooks. 

It should also be noted that you could probably just use your own questions instead of the ones provided.  Do you get what I mean?  You are being scored based on technical features, which means that the scores will still be relevant no matter what question you answer.

Let me know if you guys enjoy the tool.  Meanwhile, here is my first set of results.  I still have room for improvement, as you can see!

Note:  This screenshot does not include all of the categories mentioned above, as they were not available when the service started.

EdAgree SpeechRater

 

Here’s a mildly interesting article about student responses to speaking question three.  The authors have charted out the structure of two sample questions provided by ETS, and tracked how many of the main ideas students of various levels included in their answers (again, provided by ETS).

There is some good stuff in here for TOEFL teachers.  Particularly in how the authors map out the progression of “idea units” in the source materials.  They identified how test-takers of various levels represented these ideas units in their answers, particularly how many of these idea units they included in their answers.  Fluent speakers (or, I guess, proficient test-takers) represented more of the idea units, but also presented them in about the same order as in the sources.

Something I found quite striking, is that one of the question sets studied was much easier than the other one, something described by the authors of the report.  I am left wondering how ETS deals with this sort of thing.  The rubric doesn’t really have room to adjust for question difficulty changing week by week.

There is also a podcast interview with one of the authors.

At least once a week, a student asks me if they should get a TOEFL score review (for speaking or writing).   The short answer is: “probably not.”  The long answer is:  “if the cost is no problem, you should do it.”

However, if you want more information, here’s what you should note:

  1. In my experience, a score review results in an increase about 10% of the time.  The rest of the time, the score stays the same.  In rare cases, the score goes down.  This rate is the same for both speaking and writing.
  2. However, I have heard from some students who have gotten large increases, up to four points.  I can’t really explain this.
  3. ETS says it takes 1 to 3 weeks.  Lately, though, score reviews seem to get finished in just two or three days.  But don’t plan on getting one back so quickly.
  4. During a score review, the e-rater (for writing) and SpeechRater (for speaking) are not used.  This isn’t published information, but ETS has confirmed it when asked.  This means that score reviews are done entirely by humans.  If you think that you were punished by the automated scoring systems, you might want to request a review.
  5. The person doing the score review will not see your original score.  They will not be biased by your old score, but they will probably be aware that they are doing a score review.

In terms of money:

  1. The score review is really expensive.  Eighty dollars for one section.  However, if your score changes this money will be returned to you.  This is not a published policy, though, so it could change at any time.

Note that most students are not able to request a score review since it is not possible if your scores have been sent to institutions.  If you selected some free score recipients when you signed up for the test they will get the scores right away, and thus you will not be able to request the score review.

It’s a great day, everybody!  The TOEFL Test and Score Data Summary for 2019 is available!

These annual reports provide valuable data about test taker performance.  While this year’s figures are similar to last year’s figures, the following data points were mildly interesting to me:

  • The overall mean (average) score is still 83.  But that figure is rounded, and it looks like there was still a significant fractional increase this year.
  • The mean reading score is now 21.2 (+.4)
  • The mean listening score is now 20.9 (+.3)
  • The mean speaking score is now 20.6 (+.1)
  • The mean writing score is now 20.5 (-.2)

It is interesting that the writing score has decreased.  That may represent an ongoing trend.  Here are writing scores since 2010:

  • 2019: 20.5
  • 2018: 20.7
  • 2017: 20.8
  • 2016: 20.9
  • 2015: 20.6
  • 2014: 20.3
  • 2013: 20.6
  • 2012: No data
  • 2011: No data
  • 2010: 20.7

Some students do claim that the writing section has been getting more difficult in recent years.  They may be correct about that, but it looks like the test was really challenging back in 2014.  And it is exactly where it was a decade ago.

Interestingly, the other sections are all up since 2010.  Some by a lot:

  • Reading: 20.1 –> 21.2
  • Listening: 19.5 –> 20.9
  • Speaking: 20.0 –> 20.6

It is also worth noting that the use of automated speaking scoring does not appear to have affected average speaking scores, but that technology was only used during the last five months of 2019.


As always, it seems like a lot of the overall increase in scores is coming from the test-prep powerhouses of East Asia.  Scores in China are +1 (to 81), scores in Japan are +1 (to 72) and scores in Taiwan are +1 (to 83).  However, scores in Korea are -1 (to 83).

Scores in the key markets of Brazil (87) and India (95) are unchanged.

I would love to see which countries have the most test-takers, but I suspect that information is confidential. 

The highest scoring country is now Austria, where the average score is 100.


Women still outperform men in listening, speaking and writing. 

 

TOEFL score data for 2018 is now available.  Download a PDF right here.

The most notable bit of data is that the mean score of all test takers reached 83 points for the first time, after being stuck at 82 points for two years.

Here is a short history of mean score progression for a few selected dates.  Note that the mean score of the TOEFL iBT has increased by four points over the life of the test.  It also seems to be increasing more rapidly than before these days.  That probably accounts for the “required score creep” that bugs a lot of students.

  • 2006: 79
  • 2007: 78
  • 2008: 79
  • 2009: 79
  • 2010: 80
  • 2013: 81
  • 2014: 80
  • 2015: 81
  • 2016: 82
  • 2017: 82
  • 2018: 83 

Note that the data summaries from 2011 and 2012 don’t contain an overall mean score, as far as I can tell.

Score recipients have revised their requirements to keep up with these increases, which represent a challenge for all students. 

What makes this a challenge for some students more than others is that this increase is likely driven by huge jumps in countries with well-developed test preparation industries (and tons of test-takers).  For example, the mean score in Korea has jumped twelve points since 2006.  Korea has the absolute best TOEFL preparation options in the world, and it shows.  Here are scores from Korea for a few selected years:

  • 2006: 72
  • 2007: 77
  • 2010: 81
  • 2014: 84
  • 2017: 83
  • 2018: 84 

Meanwhile, scores in Taiwan have jumped 11 points:

  • 2006: 71
  • 2007: 72
  • 2010: 76
  • 2014: 80
  • 2017: 81
  • 2018: 82 

It is worth noting that scores in China have increased less dramatically, rising only four points from 76 to 80 between 2006 to 2018.  As has been pointed out elsewhere, China has a consistency problem when it comes to the test prep industry.  They have some of the best options for students… but some of the worst as well. It seems like things are improving for Chinese students, though, as China is likely the source of more recent increases to the overall mean score. Note that most of China’s increase has come since 2014.

In contrast to China’s recent growth, it is worth noting that the mean score in Korea has remained about the same since 2014. This indicates that there is a limit to the benefits that students can gain from research into test design and scoring. I imagine that the mean score in Taiwan will probably top out around the same level in a few years.

Once China reaches that level as well,  ETS should probably start developing the “next generation” TOEFL to replace the iBT. If there are too many teachers around the world who can show students how to “beat” the test and score way above their actual level the reliability of the iBT will be called into question.

For fun, here is the growth in a few notable countries from 2006 to 2018

  • Germany: 96 to 98
  • Brazil: 85 to 87
  • Japan: 65 to 71
  • Russia: 85 to 87
  • Iran: 78 to 85
  • India: 91 to 95

In case you are curious, the top performing countries in 2018 were Netherlands and Switzerland.  The mean score in both of those countries was 99 points.  

Today I want to write a few words about an interesting new (December, 2019) text from ETS.  “Automated Speaking Assessment” is the first book-length study of SpeechRater, which is the organization’s automated speaking assessment technology.  That makes it an extremely valuable resource for those of us who are interested in the TOEFL and how our students are assessed.  There is little in here that will make someone a better TOEFL teacher, but many readers will appreciate how it demystifies the changes to the TOEFL speaking section that were implemented in August of 2019 (that is, when the SpeechRater was put into use on the test).

I highly recommend that TOEFL teachers dive into chapter five of the book, which discusses the scoring models used in the development of SpeechRater.  Check out chapter four as well, which discusses how recorded input from students is converted into something that can actually be graded.

Chapters six, seven and eight will be the most useful for teachers.  These discuss, in turn:  features measuring fluency and pronunciation, features measuring vocabulary and grammar, and features measuring content and discourse coherence.  Experienced teachers will recognize that these three categories are quite similar to the published scoring rubrics for the TOEFL speaking section. 

In chapter six readers will learn about how the SpeechRater measures the fluency of a student by counting silences and disfluencies.  They will also learn about how it handles speed, chunking and self-corrections.  These are actually things that could influence how they prepare students for this section of the test, though I suspect that most teachers don’t need a book to tell them that silences in the middle of an answer are a bad idea.  There is also a detailed depiction of how the technology judges pronunciation, though that section was a bit to academic for me to grasp.

Chapter seven discusses grammar and vocabulary features that SpeechRater checks for.  Impressively, it just sticks them in a list.  A diligent teacher might create a sort of check list to provide to students. Finally, chapter eight discusses how the software assesses topic development in student answers.

Sadly, this book was finished just before ETS started using automated speaking scoring on high-stakes assessment.  Chapter nine discusses how the technology is used to grade TOEFL practice tests (low-stakes testing), but nothing is mentioned about its use on the actual TOEFL.  I would really love to hear more about that, particularly its ongoing relationship with the human raters who grade the same responses.

Schools Accepting TOEFL MyBest Scores

Important Update from 2020: ETS is now maintaining its own list of schools and organization that accept TOEFL MyBest Scores.  I probably won’t update my own list anymore.  You can find the official list as a PDF file right here.

The following institutions have stated publicly that they will accept TOEFL MyBest Scores. Note that this list could be out of date. It is best to contact the school you are interested in directly.

Yale Graduate School of Arts and Sciences. Source: “If you wish to send us “MyBest Scores”, we will accept them. All TOEFL scores we receive will be made available to the program reviewing your application. “

Miami University. Source: “We accept MyBest scores for the TOEFL. This means that the highest scores for each section from different TOEFL exams will determine a combined highest sum score.”

Carnegie Mellon School of Design. Source: “the School of Design also accepts MyBest scores for TOEFL iBT. “

Shoreline Community College. Source: “MyBest scores are accepted.

University of British Columbia College of Graduate Studies. Source: “The College of Graduate Studies accepts MyBest Scores.”

Northwestern (Graduate School). Source: “GS accepts the “MyBest scores”. A new reporting structure released by ETS in August 2019. These scores may be entered in the TOEFL section on the “Test Scores” page of the application form.”

University of Arizona (Graduate College). Source: “Individual MyBest scores must also be dated within 2 years of the enrollment term to be considered valid.”

University of Buffalo. Source.

CalArts. Source: “CalArts accepts “MyBest” scores delivered directly from ETS.”

San Francisco Conservatory of MusicSource: “SFCM will consider accepting the MyBest scores. We must have all score reports the MyBest scores are from submitted with the application, and the scores must be from within the past two years.”

 

This week I was lucky enough to again have an opportunity to attend a workshop hosted by ETS for TOEFL teachers.  Here is a quick summary of some of the questions that were asked by attendees of the workshop.  Note that the answers are not direct quotes, unless indicated.

 

Q:  Are scores adjusted statistically for difficulty each time the test is given?

A: Yes.  This means that there is no direct conversion from raw to scaled scores in the reading and listening section.  The conversion depends on the performance of all students that week.

 

Q: Do all the individual reading and listening questions have equal weight?

A: Yes.

 

Q:  When will new editions of the Official Guide and Official iBT Test books be published?

A:  There is no timeline.

 

Q:  Are accents from outside of North America now used when the question directions are given on the test?

A: Yes.

 

Q:  How are the scores from the human raters and the SpeechRater combined?

A:  “Human scores and machines scores are optimally weighted to produce raw scores.”  This means ETS isn’t really going to answer this question.

 

Q: Can the human rater override the SpeechRater if he disagrees with its score?

A: Yes.

 

Q:  How many different human raters will judge a single student’s speaking section?

A:  Each question will be judged by a different human.

 

Q:  Will students get a penalty for using the same templates as many other students?

A:   Templates “are not a problem at all.”

 

Q: Why were the question-specific levels removed from the score reports?

A: That information was deemed unnecessary.

 

Q:  Is there a “maximum” word count  in the writing section?

A:  No.

 

Q:  Is it always okay to pick more than one choice in multiple choice writing prompts?

A:  Yes.