In an interview with the Free Press Journal, ETS India head Sachin Jain noted that 90% of people who take the TOEFL opt to study in the USA.  According to Jain:

“Approximately 90% of TOEFL test takers choose to study in the United States, highlighting its status as the most sought-after destination for higher education. The remaining 10% diversify their academic paths by pursuing opportunities in other leading countries, including the United Kingdom, Canada, Australia, and Germany.”

The figure is a bit higher than I supposed it would be.  Perhaps efforts to crack the lucrative Canadian and Australian markets remain a work in progress.

With such a high percentage of test takers heading to the USA, the test could be at risk of losing market share to competing products that have gained more acceptance in that country in recent years.

It may be worth mentioning that while the USA is the most sought-after destination for higher education, it is only just barely so.  There were 1.1 million international students in the USA in 2023/24, compared to 1.04 million in Canada (2023) and 825,000 in Australia (2024).  For testing firms, appealing to students heading to a range of destinations is the key to financial success and security.

After I return from my holiday, I will probably take the Duolingo English Test. Let me know if there is anything I should keep an eye out for. I’ve taken this test in the past, but not since the secondary camera requirement was introduced. I’m curious to see how that feels. I haven’t experienced the latest round (several rounds?) of item revisions either.

I’d like to take the TOEFL Essentials Test. A few days ago I was caught flat footed when someone asked me to help them prep for it. And I fear that one day the test will disappear and I’ll miss my chance.

I’m curious about how Pearson does at-home testing.

Leave a comment if there are any other tests I should try to check out. In 2024, I took the following tests:

  • Password Plus
  • Skills for English SELT
  • TOEFL iBT
  • PTE Core
  • PTE Academic
  • EnglishScore (all 3)
  • MET
  • LANGUAGECERT Academic

According to reports that rolled in last week, the Educational Testing Service (ETS) has begun training individuals from outside the USA to score TOEFL test taker responses and to serve as scoring leaders.

This seems to represent something of a shift as far as the TOEFL scoring process goes.  To date, responses have been scored solely by individuals physically located in the USA (and in possession of a degree from an American university).  It is unclear at this time which countries the new raters will be located in.

IDP Education has joined the discussion on “templated responses.” Australasia/Japan head Michael James noted in an article shared to LinkedIn that:

“AI’s role in high-stakes language testing has gained attention recently, particularly after a computer-marked test revised its scoring process to include human evaluators. This change has ignited a debate on this platform about a computer’s ability to identify templated responses.”

James points out that:

“The importance of human marking in high-stakes English language assessment cannot be overstated. IELTS examiners are highly trained language experts who bring a nuanced understanding and contextual awareness that AI systems lack. They can discern not only the grammatical correctness and structural integrity of a response, but also the underlying intent, creativity, and coherence of the content. This real-time, human-centred approach aims to reveal a student’s true abilities and potential.”

His work refers to the “cautiously curious approach” that the IELTS partnership has used in the past to describe its approach to AI.

There is more worth quoting here, but it is probably best to check it out yourself at the link above.

Moving forward, I would love to hear more about the humans who do this sort of work. Not just the humans who rate IELTS responses, but those who rate responses in all sorts of tests. Who are they? What makes them “highly trained experts”? How do they discern X, Y, Z? Are they under pressure to work quickly? These are questions asked by not only score users, but (more important and more frequently) by test takers themselves.

Wrapping up this series on “templated responses” I want to share a few paragraphs recently added to the ETS website (via the new FAQ for TOEFL):

Some test takers use templates in the speaking or writing sections of the TOEFL iBT test. It can be considered a kind of “blueprint” that helps test takers organize their thoughts and write systematically within the limited time when responding to the speaking section or composing an essay.

However, there are risks to using templates. They can be helpful to establish a general structure for your response, but if they do more than that, you’re probably violating ETS testing policies. The test rules are very strict that you must be using your own words in your responses, and not those from others, or passages that you have previously memorized.

ETS uses detection software to identify writing passages that are similar to outside sources or other test takers’ responses. If you use a template, there is a high probability of providing responses similar to those of other test-takers. So we strongly recommend that you produce your own responses during the test itself.

This is significant, as is perhaps the first time ETS has directly referenced “templates” in a communication to test takers.  The TOEFL Bulletin has long contained references to “memorized content,” but that’s not quite the same thing.

The verbiage may be tricky for some to fully grasp.  Templates are “helpful to establish a general structure” but test takers “must be using [their] own words” in their responses.  When does a template cross the line from being helpful to being a violation of ETS testing policies?  That’s not immediately clear.

However, John Healy reminded me to check out “Challenges and Innovations in Speaking Assessment,” recently published by ETS via Routledge.  In an insightful chapter, Xiaoming Xi, Pam Mollaun and Larry Davis categorize potential uses of “formulaic responses” in speaking answers and how they should be viewed by raters.  Those categories are:

  1. Practiced lexical and grammatical chunks
  2. Practiced generic discourse markers
  3. Practiced task type-specific organizational frames
  4. Rehearsed generic response for a task type
  5. Heavily rehearsed content
  6. Rehearsed response

I think these categories are self-explanatory, but here are a few quick notes about what they mean:

  1. Formulaic expressions (chunks of sentences) stored in the test taker’s memory.
  2. Stuff like “in conclusion” and “another point worth mentioning…”
  3. Stuff like “The university will ___ because ____ and ____.  The man disagrees because ___ and ___.”
  4. Like category three, but without blanks to be filled in.
  5. Like number 1, but the content “[differs] from formulaic expressions present in natural language use.”  This content is “produced with little adaptation to suit the real task demands.”
  6. A response that is “identical or nearly identical to a known-source text.”

Dive into the chapter for more detailed descriptions.

It is recommended that categories 2 and 3 be scored on merit, that category 1 be scored on merit if the chunks do not match known templates, that spontaneous content in categories 4 and 5 be scored on merit (if any exists), and that category 6 be scored as a zero.

Very sensible guidelines.  But putting them into use?  The chapter notes:

“Unfortunately, the guidelines did not remove the challenge of detecting what particular language has been prepared in advance. Certain types of delivery features may be suggestive of memorized content, such as pausing or other fluency features, or use of content that appears ill-fitting or otherwise inappropriate. Nonetheless, it can be quite challenging to detect formulaic responses, especially if the response references a previously unknown source text.”

According to the article, ETS experimented with supplying raters with examples of memorized content to refer to while making decisions, but that didn’t work out for a variety of reasons. Raters became overly sensitive to memorized content. The rating process became too slow.

The authors wrap up the article by making a few suggestions for future study, including redesigned item types and AI tools.

To me, AI tools are a must in 2024, both in terms of correctly identifying overly gamed responses and avoiding false positives.  A quick glance at Glassdoor reviews suggests that response scorers (of a variety of tests) are often low-paid workers who sometimes feel pressure to work quickly.  Tools that help them work more efficiently, accurately and swiftly seem like a good idea.

It is worth sharing a few notes from a long article published by Pearson in the summer.  It provides more information about the topics discussed in Jarrad Merlo’s webinar about the introduction of human raters to the PTE tests.

The article describes how a “gaming detection system” has been developed to aid in the evaluation of two of the speaking questions and one of the writing questions on the PTE. This system gives each response a numerical score from 0 to 1, with 0 indicating the complete absence of gaming and 1 indicating significant evidence of gaming.

This numerical approach seems wise, as the lines between acceptable and unacceptable use of “templated responses” in a response are sometimes blurred.  In a future post, I’ll summarize some research from ETS that discusses this topic.

Meanwhile, Pearson’s article notes that:

“While more rudimentary systems may rely on a simple count of words matching known templates, PTE Academic’s gaming detection system has been designed to consider a number of feature measurements that quantify the similarity of the response to known templates, the amount of authentic content present, the density of templated content, and the coherence of the response”

The article goes on to describe how the results of the AI checks are passed along to human raters, to aid in their decision-making regarding the content of responses.  It notes that the newly-implemented system:

“enables raters to make better informed content scoring decisions by leveraging the Phase I gaming detection systems to provide them with information about the about the [sic] extent of gaming behaviours detected in the response.”

That’s fascinating.  I’m not aware of another system where human raters can make use of AI-generated data when making decisions about test taker responses.

The article notes that human raters will not check all written responses for templated content.  Checks of most responses will be done entirely by AI that has been trained on a regularly-updated database of templates discovered via crawls of the web and social media.

A challenge with this approach that goes unmentioned is the difficulty of detecting templates that don’t show up on the public web.  In my neck of the woods, students pay high fees to “celebrity” test prep experts who create personalized templates that are neither shared publicly nor repeated by future test takers.  This came up in an article by Sugene Kim which I’ll share in the comments.

Perhaps Pearson should go whole hog and bring in human raters for some or all responses in the writing section as well.

More on this in the days ahead.

Moving along, here’s a quick list of changes to Chapters 3 (Listening) and Chapters 4 (Reading) in the new Official Guide to the TOEFL.  Again, I’m focusing on stuff other than the major changes to the test that started back in July.

You can read the whole blog series on changes at the following links: chapter one, chapter two, chapter three and four, chapter five, the tests.

Chapter 3

Pages 122-123:

“Painters and Painting” is added as a potential lecture topic.

“Computer Science” is removed as a potential lecture topic.

“TV/Radio as mass communication” is now “media broadcasting and digital media as mass communication.”

Chapter 4:

Page 171:  Again, the length of the reading passage in question #2 is listed as 90-115 words.

Page 177:  Same as above, for question #3.

Page 178:  The sample reading for question #3 is now a single paragraph (same content, though)

Page 189:  Again, the reading passages are listed as 90-115 words.

Our friends from My Speaking Score were cool enough to provide a discount code exclusive to readers of this blog.  Register here and at the time of purchase use the code TESTRESOURCES to save 10% on your purchase of SpeechRater credits.

My Speaking score makes use of the same SpeechRater AI used by ETS to score the real TOEFL.  You can submit your practice responses and get an accurate score prediction, along with specific scores for metrics like fluency, pronunciation, coherence and grammar.  I use it with all my students.  

The site provides questions you can answer or you can just use your microphone to record responses to your own practice questions (like, say, from the TPO sets or Official Guide).

Teachers who work with students in Japan will appreciate this new article in “Language Testing in Asia.” It uses results from the TOEFL (and other tests) to create profiles of language learners in that country. Not surprisingly, the profiles match what our experience tells us.

I like this section at the end:

“When the uneven profiles come from true skill imbalance, learners and teachers may need to decide whether to focus on weaker or stronger skills for further study based on their contexts and needs. While one direction is to improve a weaker skill, a stronger skill could be further improved to compensate for the weaker one.”

Note the last part.

About a third of the students I work with nowadays are from Japan. Most of them come to me for help with the writing section. After looking at their score history I usually offer to help with the speaking section as well. The response is often something like “It is impossible for a student from Japan to score more than 23 in the speaking section, so I’m not going to work on it any more.” Even though their target is 110 overall, they’d rather just max out the other sections than “waste time” on the speaking prep. It’s an interesting approach.  It may be a correct one.

Gary J Ockey and Evgeny Chukharev-Hudilainen published an interesting article in Applied Linguistics that suggests a few interesting things. It highlights Ockey’s earlier research that suggests that the asynchronous tasks used in the TOEFL iBT speaking section “may not sufficiently assess interactional competence.”

More importantly, it compares the use of a human interviewer (a la IELTS) to a Speech Delivery System (like a chatbot) to elicit spoken English from test-takers. It seems to suggest that “the computer partner condition was found to be more dependable than the human partner condition for assessing interactional competence.” And that both were equal in areas like pronunciation, grammar and vocabulary.

Aha! This information could be used to create a better TOEFL test or a better IELTS test. Someone should let the test makers know.

No need, though, as I read at the end that “this research was funded by the ETS under a Committee of Examiners and the Test of English as a Foreign Language research grant.”

Implement it right away, I say.

I mention this now because the research will be presented tomorrow at an event hosted by the University of Melbourne.

Over the last couple of days I have been playing with ChatGTP to create TOEFL Integrated Writing questions.  I’ve had some success.  My creations aren’t perfect, but in only 30 minutes I can easily put together something that is better than what most major American publishers put in their best-selling books.  That’s remarkable.

Can I do the same with the speaking section?  Yeah, I can. 

Today I will share a couple of AI-generated TOEFL speaking questions.  These are both “type 3” questions, which include a short reading and a brief lecture on the same topic.   In both cases I probably spent about 30 minutes revising them to be more “TOEFL-like”.

Note that eventually I will stick these questions onto pages that students can more easily use to practice for the test.

First up, I generated one about a unique animal feature, which is a fairly common topic on this part of the TOEFL test.

Here’s the reading:

Transparency in Animals

Transparency is the quality of being able to see through an object. While transparency is commonly associated with glass and other transparent materials, it is also found in a number of different animal species. Transparency in animals is typically achieved through the use of specialized cells, tissues, or structures that allow light to pass through their body. This can provide a number of benefits to the animal, such as improved camouflage, enhanced communication, and reduced drag while swimming. Studying the mechanisms of transparency in animals could potentially lead to the development of new materials and technologies that are inspired by the natural world.

And here is me reading the lecture:

 

And here is a transcript of the lecture:

Okay, so I’ve got an example of transparency in the wild. The glass squid is a type of deep-sea squid that is known for its transparent body and long, thin tentacles. Glass squids are found in the deep waters of the ocean… ah… I’d say…ah… they are typically found at depths of 1000 meters or more.

The transparency of the glass squid’s body is thought to be a form of deep-sea camouflage, as it allows the squid to blend in with the surrounding water and avoid being detected by predators. The transparency of the glass squid’s body also helps it to avoid being seen by its prey, allowing it to sneak up on unsuspecting fish that it wants to eat. The ability to use transparency in both of these ways is thought to be extremely important for the survival and success of the glass squid in its challenging deep-sea habitat.

Now, in addition to using transparency for camouflage, the glass squid also uses its transparent body for communication. This is interesting. It has light-emitting organs, called photophores, that are located inside of its body and tentacles. The glass squid uses its photophores to flash patterns of light, which it uses to communicate with other glass squids. This allows it to signal its presence to other members of its species and it also allows it to coordinate its movements with other squids in its group.

Okay, so sometimes the lecture in the type three question relates a sort of anecdote from the speaker’s life.  Can the AI produce one of those?  Yes.

Here is a reading:

Regret Aversion

Regret aversion is a psychological phenomenon in which people are more likely to avoid taking risks or making decisions that may result in regret. This is because people tend to experience negative emotions, such as regret or disappointment, more strongly than positive emotions, such as happiness or satisfaction. As a result, people may avoid taking risks or making decisions that may result in regret, even if those risks or decisions could potentially lead to better outcomes. Regret aversion can affect people’s decision-making in a variety of contexts, including financial decisions, personal relationships, and career choices.

And here is me reading the lecture:

 

And a transcript of the lecture:

Okay, so, I have a perfect example of regret aversion. I had this friend. Alex. Now, Alex was always very cautious when it came to making decisions. He was constantly worried about making the wrong choice and regretting it later on. This tendency became especially pronounced whenever he was faced with a difficult decision.

One time, Alex was considering whether to quit his job and start his own business. He had been working at the same company for several years, but he had always dreamed of being his own boss. The idea of starting his own business was exciting, but it was also risky. If the business failed, Alex could lose a lot of money and damage his reputation. At first, Alex was hesitant to follow through on his plans. He was afraid of the potential consequences if it failed. He was worried that if that happened, he would regret his decision and be disappointed in himself. He was also concerned that he would have to go back to working for someone else, which he didn’t wanna do.

As a result of his regret aversion, Alex decided not to quit his job and start his own business. He continued working at the same company, even though he wasn’t happy there. He missed out on the opportunity to pursue his dream of being his own boss, and he continued to feel unfulfilled and unhappy. Much later, Alex realized that his regret aversion had held him back. He had been so afraid of regretting his decision that he had avoided making a decision altogether. He had missed out on a potentially rewarding opportunity because of his fear of regret.