As I reported yesterday, ETS (formerly the Educational Testing Service) is seeking a new executive director for the Office of Testing Integrity.  If I was to advise the incoming director I would recommend the following changes.

1. Staff up. Staff way up. Administrative review for TOEFL tests is supposed to finish in 2-4 weeks. I often hear from students who have waited for much longer. One student who spoke to me recently waited for 102 days.  Update (September, 2022): the student is still waiting.  197 days now.

2. Help test-takers help themselves. I often hear from students who have experienced score cancellations due to unauthorized software running in the background. Remember that in the Windows 10+ era it is a lot harder to control what goes on in the background of our systems than it used to be. Needless to say, modern versions of Windows are built in a way that makes remote proctoring a challenge. Duolingo recently produced a little video showing students a few ways to avoid such problems. The OTI should have made the same sort of content two years ago.

3. Reconsider the use of statistical data as a justification for score cancellations. There are very valid reasons why a student might, for example, have a speaking score much lower than their listening score. Some of those reasons are cultural. Think about that for a moment.

As always, ETS, you know how to reach me. In lieu of a consulting fee I’m willing to accept meal vouchers for the ETS cafeteria.

Updated: March 10, 2023

Students often ask me why their TOEFL scores were canceled, and how they can reinstate them.  Here’s what you need to know.

When your scores are cancelled, you’ll see something like “Scores Canceled” in your ETS account.  It will look like this:

TOEFL Scores Cancelled

There are several possible causes .

(Note that this is different from scores being “on hold” or “in administrative review.” If that is your problem, read this blog post)

Scores Canceled Accidentally

Sometimes, scores are canceled because the test-taker accidentally clicked the “do not report scores” button at the end of the test.  This sounds silly, but I hear about it every week.  Seriously.  Scores will not be sent to score recipients if they are cancelled, of course.

If you accidentally canceled your scores you can pay $20 to reinstate them via your account on the TOEFL website. It might take up to three weeks for your scores to be reinstated (source).  

Scores Canceled Because of Inappropriate Test-Taker Behavior

If you do something inappropriate during the test your scores will be cancelled.  You will probably not be given the chance to appeal, and I have never heard of this decision being reversed.  Rule violations might include touching your phone during the test (or break), running some inappropriate software in the background (see below), talking to someone, wearing jewelry, or even looking away from the screen too long.  You’d better follow the rules

Sometimes, ETS detects inappropriate software running on your computer during the test.  Such software includes Microsoft Teams, Skype, Discord, Google Drive, Zoom… and many more. This is common on computers borrowed from an employer.

If this happens, ETS will probably send you an email about it.  You can also contact the Office of Testing Integrity for more information.

Scores Canceled For Statistical Reasons

Sometimes, your scores will be canceled because the ETS Office of Testing Integrity thinks your scores are not valid for statistical reasons. There are a few reasons I’ve seen:

  • There is a big difference in your performance on the scored questions vs the unscored questions in the reading or listening section.  This is called “inconsistent variable performance” by ETS.
  • There is a big difference in your performance in one of the sections vs one of the other sections.  This is called a “section score inconsistency” by ETS.
  • Your overall score increased dramatically between attempts.
  • There is something inconsistent about your use of time on the test (you got a high score in a section even though you finished it way too quickly). 

Usually more than one of these things needs to be detected at the same time to cause scores to be canceled.

If you took the test outside of the United States  your scores will be cancelled and there will be no appeal.  You will not be given a refund. This is a new policy.

If you took the test in the United States you can appeal the decision in this way:

  1. Request a copy of the “Score Review Summary” for your test. Use those exact words. This document will summarize the statistical evidence against you.  
  2. You should ask ETS to assign an arbitrator from the American Arbitration Association to help with your case.  Use those exact words.  This person will help you challenge the case free of charge. Note that this will probably make it impossible to take legal action against ETS in the future.
  3. Feel free to contact me for assistance after you have requested the score review summary.  I will help you free of charge.

In any case, ETS will probably send you an email. You can also contact the Office of Testing Integrity for more information.

Scores Canceled Because of Plagiarism

ETS often cancels scores if they detect plagiarism in the writing and speaking sections.  Maybe they have a database of sample answers from the Internet, including the sample ones on this website.  It seems like ETS has some software called “AutoESD” that determines if essays are copied.   If ETS feels that you plagiarized your test will be cancelled  and you will not get a refund.  You cannot appeal.

The e-mail from ETS will look something like this:

I am writing to advise that the test scores issued in your name for August 21, 2022 have been canceled. In the quality control process, the ETS Writing staff noticed that your response(s) to the integrated/independent Writing task did not reflect a response to the assigned task. This was noticeable since the responses for which you receive a score should be your own original and independent work. Further reviews determined that a portion of your Writing response(s) contains ideas, language and/or examples found in other test taker responses or from published sources.

Don’t plagiarize.

You can contact the Office of Testing Integrity for more information.

There is some fascinating new data about the TOEFL iBT Home Edition available from the International Education Association of Australia.  I’m leaving on a holiday in just a moment, but I want to quickly draw attention to a few tantalizing data points.  Please note:

  1. The Home Edition is even more popular than I thought.  At least among Australia-bound students, by June of 2021 it accounted for 40% of testing.  I bet it is even higher now.
  2. Note how the mean score of Australia-bound students was 93.4 in 2019.  That is a bit higher than I would have guessed, but only a little.  You can also see the mean scores for each section.
  3. Next, note how the mean score of Australia-bound students taking the test center version of the TOEFL iBT from January to June 2021 was 94.6.  That’s a healthy jump, but it is typical of the fact that the mean increases almost every year in most countries.  This our very first look at 2021 data, by the way.
  4. But note that the mean score of Australia-bound students taking the Home Edition of the TOEFL iBT from January to June 2021 was 96.9!  More than two points higher than people taking it at a test center.  That’s wild.
  5. For people taking the Home Edition reading scores were 0.8 higher, listening scores were 1.0 higher and writing scores were 1.2 higher.
  6. Interestingly, speaking scores on the Home Edition were 0.6 lower.  That’s curious, but I think it means my advice about getting a good microphone and testing it is solid.  I can say, from experience, that trying to assess a spoken answer recording with a crappy microphone can be a frustrating experience.  My “scores” tend to be lower when assessing students who decline to use a proper recording device.  This is worthy of further study by ETS, I think.

Does this mean the TOEFL Home Edition is “easier”?  No, of course not.  It is the same test.  Does this mean that the TOEFL Home Edition is a more pleasant testing experience for test takers?  Probably.  I suspect that students who can test in a comfortable and quiet environment get higher scores.  Being able to test at a time of day when they have more energy likely helps as well.

It is worth noting that Chinese students were taking the test exclusively at test centers during this part of 2021, which might also account for the difference.   The mean score of Chinese students in 2020 was 87 points, the same as the worldwide mean.

Remember that we have worldwide data for 2020 which showed a massive increase (four points) to the worldwide mean score which, at the time, puzzled me.  I think this new report explains that jump and it makes me think there will be a small jump in the 2021 data… and another big one in the 2022 data that will reflect an environment where Chinese students have access to the Home Edition.

Well, I reported a few days ago  on the impressive increase to the mean TOEFL score found in the data released by ETS.  I expressed some puzzlement at the increase, as it is pretty huge.  I’m still not entirely certain why it happened, but after talking it out with some experts, my conclusions are:

  1. The change is mostly due to the shorter test.  I guess the shorter version is “easier.” While the reported mean score did not change in 2019, that was partly because of rounding, and a drop in the mean writing score. If we look carefully there were fractional increases in 2019 which hint at a trend.
  2. ETS may have adjusted the e-rater which scores essays. That’s a normal thing. I think they are on iteration 19 or something like that. I suspect that caused writing scores to increase. That makes up 25% of the overall increase… but the shorter test should have no effect on it.  Perhaps they wanted to address the long-term drop in average writing scores.
  3. The increase is caused in large part by China (presumably the number one TOEFL market) and Korea (presumably the number two TOEFL market). Increases in the mean score probably reflect advances in preparation techniques in those countries. Coincidentally I spent the month before the score data release reporting on those advances.

Let me know what you think in the comments.

TOEFL Score data for 2020 is available!  As regular readers of the blog will know, this is my favorite day of the year!  You can download your copy from ETS.

Scores are way up this year.  I don’t know why.

The overall mean (average) score is now 87.  That is an increase of four points, which is quite a big jump.  Here’s the history of the average TOEFL score:

  • 2006: 79
  • 2007: 78
  • 2008: 79
  • 2009: 79
  • 2010: 80
  • 2011 (not available)
  • 2012 (not available)
  • 2013: 81
  • 2014: 80
  • 2015: 81
  • 2016: 82
  • 2017: 82
  • 2018: 83 
  • 2019: 83
  • 2020: 87

As you can see, it took thirteen years for the average score to increase from 79 to 83.  That jump was replicated in 2020 alone.

Obviously this year there are also large jumps in the section scores:

  • The mean reading score is now 22.2 (+1.0)
  • The mean listening score is now 22.3 (+1.4)
  • The mean speaking score is now 21.2 (+.6)
  • The mean writing score is now 21.5 (+1.0)

Last year, the section score changes were much smaller. They were (respectively): +.4, +.3, +.1, -.2.

The jumps in 2020 alone are comparable to the jumps I recorded in the nine years from 2010 to 2019.

As you guys know, I like to study geographic trends, particularly those in China, Korea and Japan.  Here’s what I spotted:

  • The mean score in Korea is now 86 (+3)
  • The mean score in China is now 87 (+6) !!!
  • The mean score in Japan is now 73 (+1)
  • The mean score in Taiwan is now 85 (+2)

I must point out that in the thirteen years between 2006 and 2019 the average score in China increased by five points.  In 2020 alone the increase was six points.

Scores in other key markets have increased as well:

  • The mean score in Brazil is now 90 (+3)
  • The mean score in India is now 96 (+1)
  • The mean score in the United States is now 93 (+2)

The top performing country this year is Austria, with an average score of 102 (+2)

It appears that China is driving much of the overall increase.  In case you are curious, the section increases there are: Reading + 2, Listening + 2,  Speaking no change, Writing +2.

I’m going to do some more digging and some more calling in the weeks ahead.  I want to know more about these dramatic changes.

ETS has just uploaded a chart to convert between TOEFL iBT and TOEFL Essentials scores.  I’ve copied it here for you, but be sure to check out the main TOEFL Essentials Page for more information, including conversion charts for each section of the test.

Soon I will start a list of schools that accept the test, and I will maintain it until ETS publishes their own list.

 

Update (August 2024):  Sorry, EdAgree has removed the tool.  For now, I recommend My Speaking Score for SpeechRater practice.  It is a paid service, though.  Try the coupon code TESTRESOURCES for a 10% discount.

Hey, here’s something really amazing.

ETS has created a new subsidiary called EdAgree.  EdAgree is described as

…an advocate for international students providing a path to help students identify universities that will push them towards longer term success. We help you put your best foot forward during the admissions process and support you throughout your study abroad and beyond. 

As part of this mission, they provide free English speaking practice using the same SpeechRater technology that is used to grade the TOEFL!  

To access this opportunity, register for a free account on EdAgree.  After that, look for the “English Speaking Practice” button in the student dashboard.  The screenshot is from the desktop version, but it also works on mobile.

This section provides a complete set of four TOEFL speaking questions.  After you answer them, you’ll get a SpeechRater score in several different categories (pause frequency, distribution of pauses, repetitions, rhythm, response length, speaking rate, sustained speech, vocabulary depth, vocabulary diversity, vowels).  These categories are used on the real TOEFL to determine your score!  You can also listen to recordings of your answers.  Note that your responses are scored collectively, rather than individually.  That means, for example, that you get a “pause frequency” score for how you answered all four questions, and not a separate “pause frequency” score for each individual answer.

Update: The list of above categories has been revised a few times, as EdAgree has tweaked the tool.

Note that you will get fresh questions every five days.  I do not know how many unique sets there are in total.  Keep visiting and let me know.  However, you can repeat the same questions as many times as you wish.

I took a set a few days ago, and the questions were pretty good.  They weren’t 100% the same as the real TOEFL, but they were better than what is found in most textbooks. 

It should also be noted that you could probably just use your own questions instead of the ones provided.  Do you get what I mean?  You are being scored based on technical features, which means that the scores will still be relevant no matter what question you answer.

Let me know if you guys enjoy the tool.  Meanwhile, here is my first set of results.  I still have room for improvement, as you can see!

Note:  This screenshot does not include all of the categories mentioned above, as they were not available when the service started.

EdAgree SpeechRater

 

Here’s a mildly interesting article about student responses to speaking question three.  The authors have charted out the structure of two sample questions provided by ETS, and tracked how many of the main ideas students of various levels included in their answers (again, provided by ETS).

There is some good stuff in here for TOEFL teachers.  Particularly in how the authors map out the progression of “idea units” in the source materials.  They identified how test-takers of various levels represented these ideas units in their answers, particularly how many of these idea units they included in their answers.  Fluent speakers (or, I guess, proficient test-takers) represented more of the idea units, but also presented them in about the same order as in the sources.

Something I found quite striking, is that one of the question sets studied was much easier than the other one, something described by the authors of the report.  I am left wondering how ETS deals with this sort of thing.  The rubric doesn’t really have room to adjust for question difficulty changing week by week.

There is also a podcast interview with one of the authors.

At least once a week, a student asks me if they should get a TOEFL score review (for speaking or writing).   The short answer is: “probably not.”  The long answer is:  “if the cost is no problem, you should do it.”

However, if you want more information, here’s what you should note:

  1. In my experience, a score review results in an increase about 10% of the time.  The rest of the time, the score stays the same.  In rare cases, the score goes down.  This rate is the same for both speaking and writing.
  2. However, I have heard from some students who have gotten large increases, up to four points.  I can’t really explain this.
  3. ETS says it takes 1 to 3 weeks.  Lately, though, score reviews seem to get finished in just two or three days.  But don’t plan on getting one back so quickly.
  4. During a score review, the e-rater (for writing) and SpeechRater (for speaking) are not used.  This isn’t published information, but ETS has confirmed it when asked.  This means that score reviews are done entirely by humans.  If you think that you were punished by the automated scoring systems, you might want to request a review.
  5. The person doing the score review will not see your original score.  They will not be biased by your old score, but they will probably be aware that they are doing a score review.

In terms of money:

  1. The score review is really expensive.  Eighty dollars for one section.  However, if your score changes this money will be returned to you.  This is not a published policy, though, so it could change at any time.

Note that most students are not able to request a score review since it is not possible if your scores have been sent to institutions.  If you selected some free score recipients when you signed up for the test they will get the scores right away, and thus you will not be able to request the score review.

It’s a great day, everybody!  The TOEFL Test and Score Data Summary for 2019 is available!

These annual reports provide valuable data about test taker performance.  While this year’s figures are similar to last year’s figures, the following data points were mildly interesting to me:

  • The overall mean (average) score is still 83.  But that figure is rounded, and it looks like there was still a significant fractional increase this year.
  • The mean reading score is now 21.2 (+.4)
  • The mean listening score is now 20.9 (+.3)
  • The mean speaking score is now 20.6 (+.1)
  • The mean writing score is now 20.5 (-.2)

It is interesting that the writing score has decreased.  That may represent an ongoing trend.  Here are writing scores since 2010:

  • 2019: 20.5
  • 2018: 20.7
  • 2017: 20.8
  • 2016: 20.9
  • 2015: 20.6
  • 2014: 20.3
  • 2013: 20.6
  • 2012: No data
  • 2011: No data
  • 2010: 20.7

Some students do claim that the writing section has been getting more difficult in recent years.  They may be correct about that, but it looks like the test was really challenging back in 2014.  And it is exactly where it was a decade ago.

Interestingly, the other sections are all up since 2010.  Some by a lot:

  • Reading: 20.1 –> 21.2
  • Listening: 19.5 –> 20.9
  • Speaking: 20.0 –> 20.6

It is also worth noting that the use of automated speaking scoring does not appear to have affected average speaking scores, but that technology was only used during the last five months of 2019.


As always, it seems like a lot of the overall increase in scores is coming from the test-prep powerhouses of East Asia.  Scores in China are +1 (to 81), scores in Japan are +1 (to 72) and scores in Taiwan are +1 (to 83).  However, scores in Korea are -1 (to 83).

Scores in the key markets of Brazil (87) and India (95) are unchanged.

I would love to see which countries have the most test-takers, but I suspect that information is confidential. 

The highest scoring country is now Austria, where the average score is 100.


Women still outperform men in listening, speaking and writing. 

 

TOEFL score data for 2018 is now available.  Download a PDF right here.

The most notable bit of data is that the mean score of all test takers reached 83 points for the first time, after being stuck at 82 points for two years.

Here is a short history of mean score progression for a few selected dates.  Note that the mean score of the TOEFL iBT has increased by four points over the life of the test.  It also seems to be increasing more rapidly than before these days.  That probably accounts for the “required score creep” that bugs a lot of students.

  • 2006: 79
  • 2007: 78
  • 2008: 79
  • 2009: 79
  • 2010: 80
  • 2013: 81
  • 2014: 80
  • 2015: 81
  • 2016: 82
  • 2017: 82
  • 2018: 83 

Note that the data summaries from 2011 and 2012 don’t contain an overall mean score, as far as I can tell.

Score recipients have revised their requirements to keep up with these increases, which represent a challenge for all students. 

What makes this a challenge for some students more than others is that this increase is likely driven by huge jumps in countries with well-developed test preparation industries (and tons of test-takers).  For example, the mean score in Korea has jumped twelve points since 2006.  Korea has the absolute best TOEFL preparation options in the world, and it shows.  Here are scores from Korea for a few selected years:

  • 2006: 72
  • 2007: 77
  • 2010: 81
  • 2014: 84
  • 2017: 83
  • 2018: 84 

Meanwhile, scores in Taiwan have jumped 11 points:

  • 2006: 71
  • 2007: 72
  • 2010: 76
  • 2014: 80
  • 2017: 81
  • 2018: 82 

It is worth noting that scores in China have increased less dramatically, rising only four points from 76 to 80 between 2006 to 2018.  As has been pointed out elsewhere, China has a consistency problem when it comes to the test prep industry.  They have some of the best options for students… but some of the worst as well. It seems like things are improving for Chinese students, though, as China is likely the source of more recent increases to the overall mean score. Note that most of China’s increase has come since 2014.

In contrast to China’s recent growth, it is worth noting that the mean score in Korea has remained about the same since 2014. This indicates that there is a limit to the benefits that students can gain from research into test design and scoring. I imagine that the mean score in Taiwan will probably top out around the same level in a few years.

Once China reaches that level as well,  ETS should probably start developing the “next generation” TOEFL to replace the iBT. If there are too many teachers around the world who can show students how to “beat” the test and score way above their actual level the reliability of the iBT will be called into question.

For fun, here is the growth in a few notable countries from 2006 to 2018

  • Germany: 96 to 98
  • Brazil: 85 to 87
  • Japan: 65 to 71
  • Russia: 85 to 87
  • Iran: 78 to 85
  • India: 91 to 95

In case you are curious, the top performing countries in 2018 were Netherlands and Switzerland.  The mean score in both of those countries was 99 points.  

Today I want to write a few words about an interesting new (December, 2019) text from ETS.  “Automated Speaking Assessment” is the first book-length study of SpeechRater, which is the organization’s automated speaking assessment technology.  That makes it an extremely valuable resource for those of us who are interested in the TOEFL and how our students are assessed.  There is little in here that will make someone a better TOEFL teacher, but many readers will appreciate how it demystifies the changes to the TOEFL speaking section that were implemented in August of 2019 (that is, when the SpeechRater was put into use on the test).

I highly recommend that TOEFL teachers dive into chapter five of the book, which discusses the scoring models used in the development of SpeechRater.  Check out chapter four as well, which discusses how recorded input from students is converted into something that can actually be graded.

Chapters six, seven and eight will be the most useful for teachers.  These discuss, in turn:  features measuring fluency and pronunciation, features measuring vocabulary and grammar, and features measuring content and discourse coherence.  Experienced teachers will recognize that these three categories are quite similar to the published scoring rubrics for the TOEFL speaking section. 

In chapter six readers will learn about how the SpeechRater measures the fluency of a student by counting silences and disfluencies.  They will also learn about how it handles speed, chunking and self-corrections.  These are actually things that could influence how they prepare students for this section of the test, though I suspect that most teachers don’t need a book to tell them that silences in the middle of an answer are a bad idea.  There is also a detailed depiction of how the technology judges pronunciation, though that section was a bit to academic for me to grasp.

Chapter seven discusses grammar and vocabulary features that SpeechRater checks for.  Impressively, it just sticks them in a list.  A diligent teacher might create a sort of check list to provide to students. Finally, chapter eight discusses how the software assesses topic development in student answers.

Sadly, this book was finished just before ETS started using automated speaking scoring on high-stakes assessment.  Chapter nine discusses how the technology is used to grade TOEFL practice tests (low-stakes testing), but nothing is mentioned about its use on the actual TOEFL.  I would really love to hear more about that, particularly its ongoing relationship with the human raters who grade the same responses.