The ETS research department has a report about whether essays rated on desktop computers get different scores than those rated on iPads. The result of the research: they get the same scores.
This is why I love ETS. Most organizations would just tell the raters to use whatever they want. ETS, though, studied the matter very carefully. I respect that.
As always, readers will find the most interesting details buried deep in the report.
In this case, we learn that the raters participating in the study scored GRE essays. We also learn that they scored twenty essays in one hour. That’s three minutes per essay… including the time needed to queue up each essay, read it, click a button to submit the score and blink a few times before moving on to the next one. The report also indicates:
The 20 essay ratings per device were only a fraction of the number of essays a typical rater would score in a day
That tracks with what I’ve heard from former ETS raters, but I don’t think I’ve ever seen it in print. Obviously, the time taken to score TOEFL essays could be longer, but I suspect the workflow is similar.
I don’t know if this is useful information, but it is always nice to get a peek behind the curtain. Now and then.
As I mentioned earlier, the TOEFL writing rubrics are notoriously difficult to understand. Perhaps the most difficult part is the requirement that score-five and score-four independent essays demonstrate “syntactic variety” and that score-three essays include a “limited range of syntactic structures.”
What the heck is syntactic variety? What is a syntactic structure?
Here’s what you should know:
Often I see essays that are quite long and have perfect grammar. But I still can’t give them a perfect score. This is because the sentences and clauses are all very similar. Sometimes the student just uses simple sentences. Sometimes they use too many compound sentences. Sometimes every sentence starts with a transitional adverb. Sometimes every sentence starts with a pronoun. That kind of writing is boring and lacks variety.
Syntax is the arrangement of words into sentences, clauses and phrases. We don’t just put words anywhere. They have to be arranged properly to convey meaning, and for our sentences to be considered correct. Of course you know that.
“Syntactic variety” refers to the use of various types of sentences, clauses and phrases.
The best way to ensure that your TOEFL essay has syntactic variety is to use the three main sentence types in English: simple, compound, and complex sentences. You may already be familiar with these. If not, start studying.
Simple sentences look like this:
Simon took the math test. He was totally unprepared for it.
Compound sentences look like this:
Simon took the math test, but he was totally unprepared for it.
Complex sentences look like this:
Even though Simon took the math test, he was totally unprepared for it.
Note that complex sentences seem to be most important for the purposes of establishing syntactic variety and complexity.
You can further increase your syntactic variety through the use of noun, adverb and adjective clauses.
A noun clause is a group of words that functions like a noun. They often start with “how” or a “wh-” word. Like:
Why she didn’t call me is a mystery.
What I did that day surprised my family.
She listened to whatever I suggested.
These demonstrate more variety and complexity than writing:
That is a mystery.
This surprised my family.
She listened to my ideas.
Placing a noun clause in the subject position of a sentence may be considered a sign of more mature and complex writing.
An adverb clause is a group of words that functions as an adverb. Like adverbs, they usually describe how we do things. Like:
With great enthusiasm, I finished the project.
Before doing anything else, Matthew turned on his computer.
These are a bit more impressive than:
“Quickly, I finished the project.”
“Eagerly, Matthew turned on his computer.”
An adjective clause (also called a relative clause) is a group of words that functions like an adjective. It describes a noun in a sentence. Like:
“The test, which I have taken five times, is extremely difficult.”
“My friend Simone, who is three years older than me, is currently a university freshman.”
Don’t go Crazy
Remember that your essay might only be 20 sentences in total. You don’t have to do all of these things. Just include a few compound sentences and a few complex sentences. Try to work in a few of the above clauses along the way.
There are other ways to achieve syntactic variety. Standardized tests that have a more human touch explicitly mention some of them in their grading rubrics. Consider the ALP Essay Test from Colombia University, which specifically mentions such techniques as:
The TOEFL writing rubrics are famously difficult to understand. Even experienced teachers have a hard time turning them into something that students can actually make use of. Today’s blog post will kick off a series that attempts to explain what the rubrics actually refer to. Starting with…
References in the rubric to “idiomaticity” and “idiomatic language” are particularly difficult to grasp. The rubric says that a score-five independent essay should “display appropriate word choice and idiomaticity.” Meanwhile, it notes that a score-four essay should have only “minor errors” in its “use of idiomatic language.”
But what does this actually mean?
Many students (and teachers) think that ETS wants test-takers to use idioms like “it was raining cats and dogs last week” or “I won’t beat around the bush.” That is not correct. That’s a different matter.
“Idiomaticity” is tough to define, but the dictionary definition is best. It says that idiomaticity is “the extent to which a learner’s language resembles that of a native speaker.”
This is what your teachers are hinting at when they change one of your sentences not because of a specific grammar error, but because they think some of your word choices don’t seem natural.
Here’s a sentence I recently read:
“Business owners want employees to make quick decisions, which renders stress for those who take their time.”
There aren’t any grammar errors in that sentence. But “renders” sounds weird to me. Changing that to “causes” or “creates” will increase the idiomaticity of the sentence.
Here’s another one:
“When the shopping mall opened, many local shops ceased their business.”
That’s a lot more subtle. “Ceased their business” is pretty good, but it is a little bit awkward. A native speaker would probably say something like “went out of business.”
You might think I’m being needlessly picky, but to get a perfect score (5 on the rubric, 30 scaled) you need to use the best possible words at all times.
In TOEFL essays, problems related to idiomaticity seem to come from two sources:
Inexperience with the language.
A desire to shove a lot of fancy words into the essays to get a higher score.
The second source is not normal. Ignore advice from inexperienced teachers who think that using obscure words will help you. They won’t. Some of the essays I’ve read come pretty close to Noam Chomsky’s famous “colorless green ideas sleep furiously”. That’s a beautiful sentence, but no meaning can be derived from it.
As I reported yesterday, ETS (formerly the Educational Testing Service) is seeking a new executive director for the Office of Testing Integrity. If I was to advise the incoming director I would recommend the following changes.
1. Staff up. Staff way up. Administrative review for TOEFL tests is supposed to finish in 2-4 weeks. I often hear from students who have waited for much longer. One student who spoke to me recently waited for 102 days.
2. Help test-takers help themselves. I often hear from students who have experienced score cancellations due to unauthorized software running in the background. Remember that in the Windows 10+ era it is a lot harder to control what goes on in the background of our systems than it used to be. Needless to say, modern versions of Windows are built in a way that makes remote proctoring a challenge. Duolingo recently produced a little video showing students a few ways to avoid such problems. The OTI should have made the same sort of content two years ago.
3. Reconsider the use of statistical data as a justification for score cancellations. There are very valid reasons why a student might, for example, have a speaking score much lower than their listening score. Some of those reasons are cultural. Think about that for a moment.
As always, ETS, you know how to reach me. In lieu of a consulting fee I’m willing to accept meal vouchers for the ETS cafeteria.
The Home Edition is even more popular than I thought. At least among Australia-bound students, by June of 2021 it accounted for 40% of testing. I bet it is even higher now.
Note how the mean score of Australia-bound students was 93.4 in 2019. That is a bit higher than I would have guessed, but only a little. You can also see the mean scores for each section.
Next, note how the mean score of Australia-bound students taking the test center version of the TOEFL iBT from January to June 2021 was 94.6. That’s a healthy jump, but it is typical of the fact that the mean increases almost every year in most countries. This our very first look at 2021 data, by the way.
But note that the mean score of Australia-bound students taking the Home Edition of the TOEFL iBT from January to June 2021 was 96.9! More than two points higher than people taking it at a test center. That’s wild.
For people taking the Home Edition reading scores were 0.8 higher, listening scores were 1.0 higher and writing scores were 1.2 higher.
Interestingly, speaking scores on the Home Edition were 0.6 lower. That’s curious, but I think it means my advice about getting a good microphone and testing it is solid. I can say, from experience, that trying to assess a spoken answer recording with a crappy microphone can be a frustrating experience. My “scores” tend to be lower when assessing students who decline to use a proper recording device. This is worthy of further study by ETS, I think.
Does this mean the TOEFL Home Edition is “easier”? No, of course not. It is the same test. Does this mean that the TOEFL Home Edition is a more pleasant testing experience for test takers? Probably. I suspect that students who can test in a comfortable and quiet environment get higher scores. Being able to test at a time of day when they have more energy likely helps as well.
It is worth noting that Chinese students were taking the test exclusively at test centers during this part of 2021, which might also account for the difference. The mean score of Chinese students in 2020 was 87 points, the same as the worldwide mean.
Remember that we have worldwide data for 2020 which showed a massive increase (four points) to the worldwide mean score which, at the time, puzzled me. I think this new report explains that jump and it makes me think there will be a small jump in the 2021 data… and another big one in the 2022 data that will reflect an environment where Chinese students have access to the Home Edition.
Well, I reported a few days ago on the impressive increase to the mean TOEFL score found in the data released by ETS. I expressed some puzzlement at the increase, as it is pretty huge. I’m still not entirely certain why it happened, but after talking it out with some experts, my conclusions are:
The change is mostly due to the shorter test. I guess the shorter version is “easier.” While the reported mean score did not change in 2019, that was partly because of rounding, and a drop in the mean writing score. If we look carefully there were fractional increases in 2019 which hint at a trend.
ETS may have adjusted the e-rater which scores essays. That’s a normal thing. I think they are on iteration 19 or something like that. I suspect that caused writing scores to increase. That makes up 25% of the overall increase… but the shorter test should have no effect on it. Perhaps they wanted to address the long-term drop in average writing scores.
The increase is caused in large part by China (presumably the number one TOEFL market) and Korea (presumably the number two TOEFL market). Increases in the mean score probably reflect advances in preparation techniques in those countries. Coincidentally I spent the month before the score data release reporting on those advances.
ETS has just uploaded a chart to convert between TOEFL iBT and TOEFL Essentials scores. I’ve copied it here for you, but be sure to check out the main TOEFL Essentials Page for more information, including conversion charts for each section of the test.
Soon I will start a list of schools that accept the test, and I will maintain it until ETS publishes their own list.
ETS has created a new subsidiary called EdAgree. EdAgree is described as
…an advocate for international students providing a path to help students identify universities that will push them towards longer term success. We help you put your best foot forward during the admissions process and support you throughout your study abroad and beyond.
As part of this mission, they provide free English speaking practice using the same SpeechRater technology that is used to grade the TOEFL!
To access this opportunity, register for a free account on EdAgree. After that, look for the “English Speaking Practice” button in the student dashboard. The screenshot is from the desktop version, but it also works on mobile.
This section provides a complete set of four TOEFL speaking questions. After you answer them, you’ll get a SpeechRater score in several different categories (pause frequency, distribution of pauses, repetitions, rhythm, response length, speaking rate, sustained speech, vocabulary depth, vocabulary diversity, vowels). These categories are used on the real TOEFL to determine your score! You can also listen to recordings of your answers. Note that your responses are scored collectively, rather than individually. That means, for example, that you get a “pause frequency” score for how you answered all four questions, and not a separate “pause frequency” score for each individual answer.
Update: The list of above categories has been revised a few times, as EdAgree has tweaked the tool.
Note that you will get fresh questions every five days. I do not know how many unique sets there are in total. Keep visiting and let me know. However, you can repeat the same questions as many times as you wish.
I took a set a few days ago, and the questions were pretty good. They weren’t 100% the same as the real TOEFL, but they were better than what is found in most textbooks.
It should also be noted that you could probably just use your own questions instead of the ones provided. Do you get what I mean? You are being scored based on technical features, which means that the scores will still be relevant no matter what question you answer.
Let me know if you guys enjoy the tool. Meanwhile, here is my first set of results. I still have room for improvement, as you can see!
Note: This screenshot does not include all of the categories mentioned above, as they were not available when the service started.
Here’s a mildly interesting article about student responses to speaking question three. The authors have charted out the structure of two sample questions provided by ETS, and tracked how many of the main ideas students of various levels included in their answers (again, provided by ETS).
There is some good stuff in here for TOEFL teachers. Particularly in how the authors map out the progression of “idea units” in the source materials. They identified how test-takers of various levels represented these ideas units in their answers, particularly how many of these idea units they included in their answers. Fluent speakers (or, I guess, proficient test-takers) represented more of the idea units, but also presented them in about the same order as in the sources.
Something I found quite striking, is that one of the question sets studied was much easier than the other one, something described by the authors of the report. I am left wondering how ETS deals with this sort of thing. The rubric doesn’t really have room to adjust for question difficulty changing week by week.
At least once a week, a student asks me if they should get a TOEFL score review (for speaking or writing). The short answer is: “probably not.” The long answer is: “if the cost is no problem, you should do it.”
However, if you want more information, here’s what you should note:
In my experience, a score review results in an increase about 10% of the time. The rest of the time, the score stays the same. In rare cases, the score goes down. This rate is the same for both speaking and writing.
However, I have heard from some students who have gotten large increases, up to four points. I can’t really explain this.
ETS says it takes 1 to 3 weeks. Lately, though, score reviews seem to get finished in just two or three days. But don’t plan on getting one back so quickly.
During a score review, the e-rater (for writing) and SpeechRater (for speaking) are not used. This isn’t published information, but ETS has confirmed it when asked. This means that score reviews are done entirely by humans. If you think that you were punished by the automated scoring systems, you might want to request a review.
The person doing the score review will not see your original score. They will not be biased by your old score, but they will probably be aware that they are doing a score review.
In terms of money:
The score review is really expensive. Eighty dollars for one section. However, if your score changes this money will be returned to you. This is not a published policy, though, so it could change at any time.
Note that most students are not able to request a score review since it is not possible if your scores have been sent to institutions. If you selected some free score recipients when you signed up for the test they will get the scores right away, and thus you will not be able to request the score review.
These annual reports provide valuable data about test taker performance. While this year’s figures are similar to last year’s figures, the following data points were mildly interesting to me:
The overall mean (average) score is still 83. But that figure is rounded, and it looks like there was still a significant fractional increase this year.
The mean reading score is now 21.2 (+.4)
The mean listening score is now 20.9 (+.3)
The mean speaking score is now 20.6 (+.1)
The mean writing score is now 20.5 (-.2)
It is interesting that the writing score has decreased. That may represent an ongoing trend. Here are writing scores since 2010:
2012: No data
2011: No data
Some students do claim that the writing section has been getting more difficult in recent years. They may be correct about that, but it looks like the test was really challenging back in 2014. And it is exactly where it was a decade ago.
Interestingly, the other sections are all up since 2010. Some by a lot:
Reading: 20.1 –> 21.2
Listening: 19.5 –> 20.9
Speaking: 20.0 –> 20.6
It is also worth noting that the use of automated speaking scoring does not appear to have affected average speaking scores, but that technology was only used during the last five months of 2019.
As always, it seems like a lot of the overall increase in scores is coming from the test-prep powerhouses of East Asia. Scores in China are +1 (to 81), scores in Japan are +1 (to 72) and scores in Taiwan are +1 (to 83). However, scores in Korea are -1 (to 83).
Scores in the key markets of Brazil (87) and India (95) are unchanged.
I would love to see which countries have the most test-takers, but I suspect that information is confidential.
The highest scoring country is now Austria, where the average score is 100.
Women still outperform men in listening, speaking and writing.
Hey, I wrote an article on the main page about converting raw TOEFL scores to scaled TOEFL scores. It even has a couple of handy charts you can use when you take practice tests. Check it out over here.