ETS has just published its “Test and Score Data Summary” for 2021. This document contains a ton of valuable information, including average scores (and section scores) overall and in specific countries.
The average TOEFL score is now 88 points. That’s an increase of one point since last year.
Here’s a history of the average score since the test began:
2011 (not available)
2012 (not available)
As you can see, this year’s jump is not as wild as in 2020, but a one point increase is still significant.
Here’s how the section scores changed (compared to last year):
The average TOEFL reading score is now 22.4 (+0.2)
The average TOEFL listening score is now 22.6 (+.03)
The average TOEFL speaking score is now 21.1 (-0.1)
The average TOEFL writing score is now 21.6 (+0.1)
I pay special attention to trends in a few key markets. I noticed that in all of the countries I track, the average score is unchanged.
The average score in China is 87
The average score in Korea is 86
The average score in Japan is 73
The average score in Brazil is 90
The average score in India 96
The average score in the USA is 93
At first glance, it seems like the overall score increase is due to smaller markets “catching up” to the increases in the rest of the world that were observed in last year’s numbers.
One of the presenters provided new details about how the speaking questions are scored. He prefaced these details by sharing a sample type 3 speaking question. Here is the reading part:
And here is a transcript of the listening part:
A standard question, right?
Then we were shown the “answer sheet” that is given to raters so they know how to assess topic development. This is new information. Here it is:
That’s interesting, right? My assessment of this is that for an answer to receive a full score for topic development it must explicitly or implicitly reference the term and its definition. It must also broadly summarize the example. And then it must include just two of the four main details given in the example. The last part is new to me. Generally, I push students to include all of the details. Perhaps I should reassess my teaching methods.
There you go, teachers. Some new information about the TOEFL… in 2022.
A few questions remain:
Is this always the case? Will there always be four main details in the example? Will we always need to include just two of them? Probably not. Surely there are cases where more than two details are required.
How does this work in lectures which have two totally unique examples? Often the reading is about a biological feature in animals, while the lecture describes two different animals that have this feature. Is it okay to ignore one of them? Probably not.
Can any of this learning be applied to TOEFL Speaking question four? Probably not.
The ETS research department has a report about whether essays rated on desktop computers get different scores than those rated on iPads. The result of the research: they get the same scores.
This is why I love ETS. Most organizations would just tell the raters to use whatever they want. ETS, though, studied the matter very carefully. I respect that.
As always, readers will find the most interesting details buried deep in the report.
In this case, we learn that the raters participating in the study scored GRE essays. We also learn that they scored twenty essays in one hour. That’s three minutes per essay… including the time needed to queue up each essay, read it, click a button to submit the score and blink a few times before moving on to the next one. The report also indicates:
The 20 essay ratings per device were only a fraction of the number of essays a typical rater would score in a day
That tracks with what I’ve heard from former ETS raters, but I don’t think I’ve ever seen it in print. Obviously, the time taken to score TOEFL essays could be longer, but I suspect the workflow is similar.
I don’t know if this is useful information, but it is always nice to get a peek behind the curtain. Now and then.
As I mentioned earlier, the TOEFL writing rubrics are notoriously difficult to understand. Perhaps the most difficult part is the requirement that score-five and score-four independent essays demonstrate “syntactic variety” and that score-three essays include a “limited range of syntactic structures.”
What the heck is syntactic variety? What is a syntactic structure?
Here’s what you should know:
Often I see essays that are quite long and have perfect grammar. But I still can’t give them a perfect score. This is because the sentences and clauses are all very similar. Sometimes the student just uses simple sentences. Sometimes they use too many compound sentences. Sometimes every sentence starts with a transitional adverb. Sometimes every sentence starts with a pronoun. That kind of writing is boring and lacks variety.
Syntax is the arrangement of words into sentences, clauses and phrases. We don’t just put words anywhere. They have to be arranged properly to convey meaning, and for our sentences to be considered correct. Of course you know that.
“Syntactic variety” refers to the use of various types of sentences, clauses and phrases.
The best way to ensure that your TOEFL essay has syntactic variety is to use the three main sentence types in English: simple, compound, and complex sentences. You may already be familiar with these. If not, start studying.
Simple sentences look like this:
Simon took the math test. He was totally unprepared for it.
Compound sentences look like this:
Simon took the math test, but he was totally unprepared for it.
Complex sentences look like this:
Even though Simon took the math test, he was totally unprepared for it.
Note that complex sentences seem to be most important for the purposes of establishing syntactic variety and complexity.
You can further increase your syntactic variety through the use of noun, adverb and adjective clauses.
A noun clause is a group of words that functions like a noun. They often start with “how” or a “wh-” word. Like:
Why she didn’t call me is a mystery.
What I did that day surprised my family.
She listened to whatever I suggested.
These demonstrate more variety and complexity than writing:
That is a mystery.
This surprised my family.
She listened to my ideas.
Placing a noun clause in the subject position of a sentence may be considered a sign of more mature and complex writing.
An adverb clause is a group of words that functions as an adverb. Like adverbs, they usually describe how we do things. Like:
With great enthusiasm, I finished the project.
Before doing anything else, Matthew turned on his computer.
These are a bit more impressive than:
“Quickly, I finished the project.”
“Eagerly, Matthew turned on his computer.”
An adjective clause (also called a relative clause) is a group of words that functions like an adjective. It describes a noun in a sentence. Like:
“The test, which I have taken five times, is extremely difficult.”
“My friend Simone, who is three years older than me, is currently a university freshman.”
Don’t go Crazy
Remember that your essay might only be 20 sentences in total. You don’t have to do all of these things. Just include a few compound sentences and a few complex sentences. Try to work in a few of the above clauses along the way.
There are other ways to achieve syntactic variety. Standardized tests that have a more human touch explicitly mention some of them in their grading rubrics. Consider the ALP Essay Test from Colombia University, which specifically mentions such techniques as:
The TOEFL writing rubrics are famously difficult to understand. Even experienced teachers have a hard time turning them into something that students can actually make use of. Today’s blog post will kick off a series that attempts to explain what the rubrics actually refer to. Starting with…
References in the rubric to “idiomaticity” and “idiomatic language” are particularly difficult to grasp. The rubric says that a score-five independent essay should “display appropriate word choice and idiomaticity.” Meanwhile, it notes that a score-four essay should have only “minor errors” in its “use of idiomatic language.”
But what does this actually mean?
Many students (and teachers) think that ETS wants test-takers to use idioms like “it was raining cats and dogs last week” or “I won’t beat around the bush.” That is not correct. That’s a different matter.
“Idiomaticity” is tough to define, but the dictionary definition is best. It says that idiomaticity is “the extent to which a learner’s language resembles that of a native speaker.”
This is what your teachers are hinting at when they change one of your sentences not because of a specific grammar error, but because they think some of your word choices don’t seem natural.
Here’s a sentence I recently read:
“Business owners want employees to make quick decisions, which renders stress for those who take their time.”
There aren’t any grammar errors in that sentence. But “renders” sounds weird to me. Changing that to “causes” or “creates” will increase the idiomaticity of the sentence.
Here’s another one:
“When the shopping mall opened, many local shops ceased their business.”
That’s a lot more subtle. “Ceased their business” is pretty good, but it is a little bit awkward. A native speaker would probably say something like “went out of business.”
I would even complain about something like:
“I strongly think that children should attend all of their classes”
My preferred phrasing would be something like “I strongly believe…”.
You might think I’m being needlessly picky, but to get a perfect score (5 on the rubric, 30 scaled) you need to use the best possible words at all times.
In TOEFL essays, problems related to idiomaticity seem to come from two sources:
Inexperience with the language.
A desire to shove a lot of fancy words into the essays to get a higher score.
The second source is not normal. Ignore advice from inexperienced teachers who think that using obscure words will help you. They won’t. Some of the essays I’ve read come pretty close to Noam Chomsky’s famous “colorless green ideas sleep furiously”. That’s a beautiful sentence, but no meaning can be derived from it.
As I reported yesterday, ETS (formerly the Educational Testing Service) is seeking a new executive director for the Office of Testing Integrity. If I was to advise the incoming director I would recommend the following changes.
1. Staff up. Staff way up. Administrative review for TOEFL tests is supposed to finish in 2-4 weeks. I often hear from students who have waited for much longer. One student who spoke to me recently waited for 102 days. Update (September, 2022): the student is still waiting. 197 days now.
2. Help test-takers help themselves. I often hear from students who have experienced score cancellations due to unauthorized software running in the background. Remember that in the Windows 10+ era it is a lot harder to control what goes on in the background of our systems than it used to be. Needless to say, modern versions of Windows are built in a way that makes remote proctoring a challenge. Duolingo recently produced a little video showing students a few ways to avoid such problems. The OTI should have made the same sort of content two years ago.
3. Reconsider the use of statistical data as a justification for score cancellations. There are very valid reasons why a student might, for example, have a speaking score much lower than their listening score. Some of those reasons are cultural. Think about that for a moment.
As always, ETS, you know how to reach me. In lieu of a consulting fee I’m willing to accept meal vouchers for the ETS cafeteria.
Students often ask me why their TOEFL scores were canceled, and how they can reinstate them. Here’s what you need to know.
When your scores are cancelled, you’ll see something like “Scores Canceled” in your ETS account. It will look like this:
There are several possible causes .
(Note that this is different from scores being “on hold” or “in administrative review.” If that is your problem, read this blog post)
Scores Canceled Accidentally
Sometimes, scores are canceled because the test-taker accidentally clicked the “do not report scores” button at the end of the test. This sounds silly, but I hear about it every week. Seriously. Scores will not be sent to score recipients if they are cancelled, of course.
If you accidentally canceled your scores you can pay $20 to reinstate them via your account on the TOEFL website. It might take up to three weeks for your scores to be reinstated (source).
Scores Canceled Because of Inappropriate Test-Taker Behavior
If you do something inappropriate during the test your scores will be cancelled. You will probably not be given the chance to appeal, and I have never heard of this decision being reversed. Rule violations might include touching your phone during the test (or break), running some inappropriate software in the background (see below), talking to someone, wearing jewelry, or even looking away from the screen too long. You’d better follow the rules.
Sometimes, ETS detects inappropriate software running on your computer during the test. Such software includes Microsoft Teams, Skype, Discord, Google Drive, Zoom… and many more. This is common on computers borrowed from an employer.
Sometimes, your scores will be canceled because the ETS Office of Testing Integrity thinks your scores are not valid for statistical reasons. There are a few reasons I’ve seen:
There is a big difference in your performance on the scored questions vs the unscored questions in the reading or listening section. This is called “inconsistent variable performance” by ETS.
There is a big difference in your performance in one of the sections vs one of the other sections. This is called a “section score inconsistency” by ETS.
Your overall score increased dramatically between attempts.
There is something inconsistent about your use of time on the test (you got a high score in a section even though you finished it way too quickly).
Usually more than one of these things needs to be detected at the same time to cause scores to be canceled.
If you took the test outside of the United States your scores will be cancelled and there will be no appeal. You will not be given a refund. This is a new policy.
If you took the test in the United States you can appeal the decision in this way:
Request a copy of the “Score Review Summary” for your test. Use those exact words. This document will summarize the statistical evidence against you.
You should ask ETS to assign an arbitrator from the American Arbitration Association to help with your case. Use those exact words. This person will help you challenge the case free of charge. Note that this will probably make it impossible to take legal action against ETS in the future.
Feel free to contact me for assistance after you have requested the score review summary. I will help you free of charge.
ETS often cancels scores if they detect plagiarism in the writing and speaking sections. I’m pretty sure they have a database of sample answers from the Internet, including the sample ones on this website. It seems like ETS has some software called “AutoESD” that determines if essays are copied. If this happens your test will be cancelled and you will not get a refund. You cannot appeal.
The Home Edition is even more popular than I thought. At least among Australia-bound students, by June of 2021 it accounted for 40% of testing. I bet it is even higher now.
Note how the mean score of Australia-bound students was 93.4 in 2019. That is a bit higher than I would have guessed, but only a little. You can also see the mean scores for each section.
Next, note how the mean score of Australia-bound students taking the test center version of the TOEFL iBT from January to June 2021 was 94.6. That’s a healthy jump, but it is typical of the fact that the mean increases almost every year in most countries. This our very first look at 2021 data, by the way.
But note that the mean score of Australia-bound students taking the Home Edition of the TOEFL iBT from January to June 2021 was 96.9! More than two points higher than people taking it at a test center. That’s wild.
For people taking the Home Edition reading scores were 0.8 higher, listening scores were 1.0 higher and writing scores were 1.2 higher.
Interestingly, speaking scores on the Home Edition were 0.6 lower. That’s curious, but I think it means my advice about getting a good microphone and testing it is solid. I can say, from experience, that trying to assess a spoken answer recording with a crappy microphone can be a frustrating experience. My “scores” tend to be lower when assessing students who decline to use a proper recording device. This is worthy of further study by ETS, I think.
Does this mean the TOEFL Home Edition is “easier”? No, of course not. It is the same test. Does this mean that the TOEFL Home Edition is a more pleasant testing experience for test takers? Probably. I suspect that students who can test in a comfortable and quiet environment get higher scores. Being able to test at a time of day when they have more energy likely helps as well.
It is worth noting that Chinese students were taking the test exclusively at test centers during this part of 2021, which might also account for the difference. The mean score of Chinese students in 2020 was 87 points, the same as the worldwide mean.
Remember that we have worldwide data for 2020 which showed a massive increase (four points) to the worldwide mean score which, at the time, puzzled me. I think this new report explains that jump and it makes me think there will be a small jump in the 2021 data… and another big one in the 2022 data that will reflect an environment where Chinese students have access to the Home Edition.
Well, I reported a few days ago on the impressive increase to the mean TOEFL score found in the data released by ETS. I expressed some puzzlement at the increase, as it is pretty huge. I’m still not entirely certain why it happened, but after talking it out with some experts, my conclusions are:
The change is mostly due to the shorter test. I guess the shorter version is “easier.” While the reported mean score did not change in 2019, that was partly because of rounding, and a drop in the mean writing score. If we look carefully there were fractional increases in 2019 which hint at a trend.
ETS may have adjusted the e-rater which scores essays. That’s a normal thing. I think they are on iteration 19 or something like that. I suspect that caused writing scores to increase. That makes up 25% of the overall increase… but the shorter test should have no effect on it. Perhaps they wanted to address the long-term drop in average writing scores.
The increase is caused in large part by China (presumably the number one TOEFL market) and Korea (presumably the number two TOEFL market). Increases in the mean score probably reflect advances in preparation techniques in those countries. Coincidentally I spent the month before the score data release reporting on those advances.
ETS has just uploaded a chart to convert between TOEFL iBT and TOEFL Essentials scores. I’ve copied it here for you, but be sure to check out the main TOEFL Essentials Page for more information, including conversion charts for each section of the test.
Soon I will start a list of schools that accept the test, and I will maintain it until ETS publishes their own list.
ETS has created a new subsidiary called EdAgree. EdAgree is described as
…an advocate for international students providing a path to help students identify universities that will push them towards longer term success. We help you put your best foot forward during the admissions process and support you throughout your study abroad and beyond.
As part of this mission, they provide free English speaking practice using the same SpeechRater technology that is used to grade the TOEFL!
To access this opportunity, register for a free account on EdAgree. After that, look for the “English Speaking Practice” button in the student dashboard. The screenshot is from the desktop version, but it also works on mobile.
This section provides a complete set of four TOEFL speaking questions. After you answer them, you’ll get a SpeechRater score in several different categories (pause frequency, distribution of pauses, repetitions, rhythm, response length, speaking rate, sustained speech, vocabulary depth, vocabulary diversity, vowels). These categories are used on the real TOEFL to determine your score! You can also listen to recordings of your answers. Note that your responses are scored collectively, rather than individually. That means, for example, that you get a “pause frequency” score for how you answered all four questions, and not a separate “pause frequency” score for each individual answer.
Update: The list of above categories has been revised a few times, as EdAgree has tweaked the tool.
Note that you will get fresh questions every five days. I do not know how many unique sets there are in total. Keep visiting and let me know. However, you can repeat the same questions as many times as you wish.
I took a set a few days ago, and the questions were pretty good. They weren’t 100% the same as the real TOEFL, but they were better than what is found in most textbooks.
It should also be noted that you could probably just use your own questions instead of the ones provided. Do you get what I mean? You are being scored based on technical features, which means that the scores will still be relevant no matter what question you answer.
Let me know if you guys enjoy the tool. Meanwhile, here is my first set of results. I still have room for improvement, as you can see!
Note: This screenshot does not include all of the categories mentioned above, as they were not available when the service started.
Here’s a mildly interesting article about student responses to speaking question three. The authors have charted out the structure of two sample questions provided by ETS, and tracked how many of the main ideas students of various levels included in their answers (again, provided by ETS).
There is some good stuff in here for TOEFL teachers. Particularly in how the authors map out the progression of “idea units” in the source materials. They identified how test-takers of various levels represented these ideas units in their answers, particularly how many of these idea units they included in their answers. Fluent speakers (or, I guess, proficient test-takers) represented more of the idea units, but also presented them in about the same order as in the sources.
Something I found quite striking, is that one of the question sets studied was much easier than the other one, something described by the authors of the report. I am left wondering how ETS deals with this sort of thing. The rubric doesn’t really have room to adjust for question difficulty changing week by week.