A new research report from ETS describes how “similar speech responses” are flagged in the TOEFL. “Similar” means responses that are overly similar to those of another test taker, or overly similar to a given question prompt. I’ll post a link to the report in the comments.

Basically, test taker responses are transcribed using speech recognition software (AWS transcribe) and run through three similarity checks.  Responses that exceed a certain threshold in all three checks are flagged for human review. The particulars of each check are described, but remain beyond my comprehension.

If my understanding is correct, in the study at hand 29% of flagged responses in the independent speaking task were subject to cancellation after human review, as were 48% of the flagged integrated task responses.

As you may have guessed from the terms above, the article summarizes a study carried out during the period of old TOEFL iBT.  Actually, it was carried out during the period of the old-old TOEFL (even before the 2023 revisions).

This is relevant because answer similarity, I think, was probably more of an issue in the old TOEFL than it is in the new TOEFL.

The central conceit of the speaking section of that test was that it required the use of integrated skills and asked test takers to combine details from multiple information sources. Accordingly, test takers around the world were trained to use the same “scaffolding” to structure most of their responses (“According to the reading… the lecturer illustrates this concept with a couple of examples… First, he notes…”). This situation may have been exacerbated by a lack of variety in the structure and flow of those sources (recall the announcement and matching conversation in Q2, for example). On top of that, test takers would inevitably paraphrase the sources in similar ways.

The old independent speaking task was a bit better, but false positives are easy to imagine. For instance, there are only so many ways to say that you prefer to work on projects in groups than alone. If you ask 50,000 people this question you are bound to hear some eerily similar responses even when everyone is on the level.

The new test is probably a bit better. There are no integrated questions and no information summarization questions, so scaffolding is less of an issue.  And items in the “take an interview” task seem to have more variety and nuance than items in the old “independent speaking” items. On top of that, the reduction in prep time (0 seconds vs 45 seconds) on the new item turns it into more of a stream of consciousness thing than an overly rehearsed thing.

The design of the new test suggests other issues, of course. I can’t help but think that ETS will need to hire raters to flag every test taker response as on-topic or off-topic, as I think Pearson now does on the PTE.  But that’s something for a later discussion. 

The “Listen and Repeat” task on the new TOEFL attracted much attention soon after ETS published sample test forms. Some commentators pointed to it as evidence that the test would be “easier” than the old TOEFL. I suppose ETS didn’t do themselves any favors by selecting a list of sentences about visiting a zoo for inclusion in the first sample form. I certainly snickered when I first saw it.

In light of this, it is interesting to note that chatter on social media by actual test takers now paints this task as one of the most challenging parts of the new test. While working through practice sets on my own, I’ve enjoyed examining the mental processes involved in repeating longer sentences (~25 syllables) with two clauses. I find myself mentally replaying the first clause while the second is still being played aloud. And mentally repeating the whole thing once again before speaking. A lot goes on in my brain in the 15 seconds it takes to complete all this. Imagine doing it in an L2.

One could therefore argue that this sort of task, while of limited utility overall, can provide insights that we don’t get from a traditional constructed response item or an in-person interview. And since it only takes two minutes to administer, perhaps it can be included without taking time away from more traditional tasks. We could say that this is using the technology of 2026 to our advantage.

Notably, the PTE test also includes a “listen and repeat” task… but the PTE Express does not. I think the Duolingo English Test included one until about 2023, though alcohol consumption has made my memory of that period somewhat foggy. To me, this suggests that if a testmaker has 90 or 120 minutes to play with, they can go ahead and include some shorter items that poke and prod at narrow aspects of language use… even if they don’t exactly resemble real life encounters.

On the other hand, the TOEFL deep dive recently published by the IELTS partners examined the task and noted that:

“…the ‘Listen and Repeat’ task type appears to tap only minimally into higher-level cognitive processing and is weakly aligned with meaning-oriented, authentic oral communication.”

This mirrors comments in a review of the Versant English Speaking and Listening Test (from Pearson) published in Language Assessment Quarterly last month. Regarding the L & R task in that test, it notes that:

“[a]lthough the ‘Repeat the sentence’ task appears to measure some elements of the operational Listening and Speaking sub-constructs, it appears to have minimal relevance to the widely accepted oral communication construct.”

In a response, Pearson’s Bill Bonk and Jooyoung Lee said that:

“…the ability to accurately comprehend and reproduce sentence-level utterances is a foundational prerequisite for communication. Without reliable sentence-level processing – encompassing phonological decoding, lexical access, syntactic parsing, and short-term retention – higher-level discourse processing cannot occur.”

Food for thought.

I often get questions about how timers work on the TOEFL test.  So here’s a quick summary.  These details will be accurate until the TOEFL Test changes on January 21, 2026.

Reading

  • There is one 36-minute timer for the whole reading section.  You will have 36 minutes to read both of the articles and answer all of the questions.

Listening

  • There are two separate timers in the listening section.
  • One of the timers is 10 minutes.  You will have 10 minutes to answer 17 questions about two lectures and one conversation.  The timer only counts down when you are answering questions.  It does not move while you are listening to the lectures and conversation.
  • The other timer is 6.5 minutes.  You will have 6.5 minutes to answer 11 questions about one lecture and one conversation.  The timer only counts down when you are answering questions.  It does not move when you are listening to the lecture and conversation.

Speaking

  • Question One:  After you hear the question you will have 15 seconds to prepare and 45 seconds to speak.
  • Question Two:  You will have 45 or 50 seconds to read the announcement.  Then you will listen to a conversation.  Then you will have 30 seconds to prepare and 60 seconds to speak.
  • Question Three:  You will have 45 or 50 seconds to read the article.  Then you will listen to a lecture.  Then you will have 30 seconds to prepare and 60 seconds to speak.
  • Question Four:  You will first listen to a lecture.  Then you will have 20 seconds to prepare and 60 seconds to speak.

Writing

  • Question One: First, you will have 3 minutes to read the article.  Then you will listen to a lecture.  Then you will have 20 minutes to write your response.  The article will be visible as you write.
  • Question Two:  You will have 10 minutes to read everything and write your response.

In all sections, timers only start after instructions have been given.  There are no breaks.

I finally uploaded the 2025 version of my guide to TOEFL Speaking Question One.  It includes new sample questions, a new sample answer, a couple of templates (one reason & two reasons), some general tips and some comments about AI.  It also discusses how fast you should speak when giving an answer.  Check it out below, and stay tuned for the 2025 updates to questions two through four.

Express scoring for the TOEFL is available again.  Test takers who pay a fee of $149 will receive their scores within 24 hours of taking the test.  Otherwise, scores are reported in 4-8 days. This option first appeared near the end of 2024, but was quickly withdrawn. At that time the fee was $99.

Interestingly, this option only appears when the test is to be taken at a test center. I don’t see it when attempting to book an at-home test.

Having this option is better than not having it.  But note that IELTS scores are now delivered in 1-2 days without an extra fee and that Pearson promises to deliver PTE scores in 2 days without asking for any additional payment either.

There is a wonderful new article in Language Testing Journal by Emma Bruce, Karen Dunn and Tony Clark which explores test score validity periods for high-stakes tests.  It isn’t in open access, though, so you’ll need institutional access or a healthy billfold to read it.

As most readers know, institutions and regulatory bodies generally won’t accept scores from tests taken more than two years ago.  This is based on research and advice from test makers, though the authors note that:

“While the role of test providers and language testing researchers is not to set the policy for test score use, it is becoming apparent that the messaging surrounding validity periods may benefit from consideration through a contemporary lens. While it is certain that test developers have a responsibility to communicate the idea that the fidelity of a test score in reflecting test-takers’ language proficiency may change over time depending on the circumstances of the test-taker in the period between taking the test and using the score, it is of equal import to communicate–especially to policymakers–the possibility of adapting the 2-year requirement according to risk or need in any given setting.”

Unmentioned is the fact that even if institutions desire to accept scores that are older than two years, it can be exceptionally difficult to actually receive those scores.  Correct me if I’m wrong, but I believe that none of the big four tests (TOEFL, IELTS, PTE and DET) allow test takers to send scores to recipients more than two years after a test date. In this way, it seems like the test makers are semi-enforcing a two-year validity period. I can’t even view the scores from my 2022 attempt at the TOEFL within my account on the ETS website.

After I return from my holiday, I will probably take the Duolingo English Test. Let me know if there is anything I should keep an eye out for. I’ve taken this test in the past, but not since the secondary camera requirement was introduced. I’m curious to see how that feels. I haven’t experienced the latest round (several rounds?) of item revisions either.

I’d like to take the TOEFL Essentials Test. A few days ago I was caught flat footed when someone asked me to help them prep for it. And I fear that one day the test will disappear and I’ll miss my chance.

I’m curious about how Pearson does at-home testing.

Leave a comment if there are any other tests I should try to check out. In 2024, I took the following tests:

  • Password Plus
  • Skills for English SELT
  • TOEFL iBT
  • PTE Core
  • PTE Academic
  • EnglishScore (all 3)
  • MET
  • LANGUAGECERT Academic

According to reports that rolled in last week, the Educational Testing Service (ETS) has begun training individuals from outside the USA to score TOEFL test taker responses and to serve as scoring leaders.

This seems to represent something of a shift as far as the TOEFL scoring process goes.  To date, responses have been scored solely by individuals physically located in the USA (and in possession of a degree from an American university).  It is unclear at this time which countries the new raters will be located in.

Update from May 2025:  It appears that all of the new raters and scoring leaders are based in India and employed by an outsourcing firm called Firstsource.

Update:  For a little more confirmation, head over to the ETS Glassdoor page.

IDP Education has joined the discussion on “templated responses.” Australasia/Japan head Michael James noted in an article shared to LinkedIn that:

“AI’s role in high-stakes language testing has gained attention recently, particularly after a computer-marked test revised its scoring process to include human evaluators. This change has ignited a debate on this platform about a computer’s ability to identify templated responses.”

James points out that:

“The importance of human marking in high-stakes English language assessment cannot be overstated. IELTS examiners are highly trained language experts who bring a nuanced understanding and contextual awareness that AI systems lack. They can discern not only the grammatical correctness and structural integrity of a response, but also the underlying intent, creativity, and coherence of the content. This real-time, human-centred approach aims to reveal a student’s true abilities and potential.”

His work refers to the “cautiously curious approach” that the IELTS partnership has used in the past to describe its approach to AI.

There is more worth quoting here, but it is probably best to check it out yourself at the link above.

Moving forward, I would love to hear more about the humans who do this sort of work. Not just the humans who rate IELTS responses, but those who rate responses in all sorts of tests. Who are they? What makes them “highly trained experts”? How do they discern X, Y, Z? Are they under pressure to work quickly? These are questions asked by not only score users, but (more important and more frequently) by test takers themselves.

Wrapping up this series on “templated responses” I want to share a few paragraphs recently added to the ETS website (via the new FAQ for TOEFL):

Some test takers use templates in the speaking or writing sections of the TOEFL iBT test. It can be considered a kind of “blueprint” that helps test takers organize their thoughts and write systematically within the limited time when responding to the speaking section or composing an essay.

However, there are risks to using templates. They can be helpful to establish a general structure for your response, but if they do more than that, you’re probably violating ETS testing policies. The test rules are very strict that you must be using your own words in your responses, and not those from others, or passages that you have previously memorized.

ETS uses detection software to identify writing passages that are similar to outside sources or other test takers’ responses. If you use a template, there is a high probability of providing responses similar to those of other test-takers. So we strongly recommend that you produce your own responses during the test itself.

This is significant, as is perhaps the first time ETS has directly referenced “templates” in a communication to test takers.  The TOEFL Bulletin has long contained references to “memorized content,” but that’s not quite the same thing.

The verbiage may be tricky for some to fully grasp.  Templates are “helpful to establish a general structure” but test takers “must be using [their] own words” in their responses.  When does a template cross the line from being helpful to being a violation of ETS testing policies?  That’s not immediately clear.

However, John Healy reminded me to check out “Challenges and Innovations in Speaking Assessment,” recently published by ETS via Routledge.  In an insightful chapter, Xiaoming Xi, Pam Mollaun and Larry Davis categorize potential uses of “formulaic responses” in speaking answers and how they should be viewed by raters.  Those categories are:

  1. Practiced lexical and grammatical chunks
  2. Practiced generic discourse markers
  3. Practiced task type-specific organizational frames
  4. Rehearsed generic response for a task type
  5. Heavily rehearsed content
  6. Rehearsed response

I think these categories are self-explanatory, but here are a few quick notes about what they mean:

  1. Formulaic expressions (chunks of sentences) stored in the test taker’s memory.
  2. Stuff like “in conclusion” and “another point worth mentioning…”
  3. Stuff like “The university will ___ because ____ and ____.  The man disagrees because ___ and ___.”
  4. Like category three, but without blanks to be filled in.
  5. Like number 1, but the content “[differs] from formulaic expressions present in natural language use.”  This content is “produced with little adaptation to suit the real task demands.”
  6. A response that is “identical or nearly identical to a known-source text.”

Dive into the chapter for more detailed descriptions.

It is recommended that categories 2 and 3 be scored on merit, that category 1 be scored on merit if the chunks do not match known templates, that spontaneous content in categories 4 and 5 be scored on merit (if any exists), and that category 6 be scored as a zero.

Very sensible guidelines.  But putting them into use?  The chapter notes:

“Unfortunately, the guidelines did not remove the challenge of detecting what particular language has been prepared in advance. Certain types of delivery features may be suggestive of memorized content, such as pausing or other fluency features, or use of content that appears ill-fitting or otherwise inappropriate. Nonetheless, it can be quite challenging to detect formulaic responses, especially if the response references a previously unknown source text.”

According to the article, ETS experimented with supplying raters with examples of memorized content to refer to while making decisions, but that didn’t work out for a variety of reasons. Raters became overly sensitive to memorized content. The rating process became too slow.

The authors wrap up the article by making a few suggestions for future study, including redesigned item types and AI tools.

To me, AI tools are a must in 2024, both in terms of correctly identifying overly gamed responses and avoiding false positives.  A quick glance at Glassdoor reviews suggests that response scorers (of a variety of tests) are often low-paid workers who sometimes feel pressure to work quickly.  Tools that help them work more efficiently, accurately and swiftly seem like a good idea.

It is worth sharing a few notes from a long article published by Pearson in the summer.  It provides more information about the topics discussed in Jarrad Merlo’s webinar about the introduction of human raters to the PTE tests.

The article describes how a “gaming detection system” has been developed to aid in the evaluation of two of the speaking questions and one of the writing questions on the PTE. This system gives each response a numerical score from 0 to 1, with 0 indicating the complete absence of gaming and 1 indicating significant evidence of gaming.

This numerical approach seems wise, as the lines between acceptable and unacceptable use of “templated responses” in a response are sometimes blurred.  In a future post, I’ll summarize some research from ETS that discusses this topic.

Meanwhile, Pearson’s article notes that:

“While more rudimentary systems may rely on a simple count of words matching known templates, PTE Academic’s gaming detection system has been designed to consider a number of feature measurements that quantify the similarity of the response to known templates, the amount of authentic content present, the density of templated content, and the coherence of the response”

The article goes on to describe how the results of the AI checks are passed along to human raters, to aid in their decision-making regarding the content of responses.  It notes that the newly-implemented system:

“enables raters to make better informed content scoring decisions by leveraging the Phase I gaming detection systems to provide them with information about the about the [sic] extent of gaming behaviours detected in the response.”

That’s fascinating.  I’m not aware of another system where human raters can make use of AI-generated data when making decisions about test taker responses.

The article notes that human raters will not check all written responses for templated content.  Checks of most responses will be done entirely by AI that has been trained on a regularly-updated database of templates discovered via crawls of the web and social media.

A challenge with this approach that goes unmentioned is the difficulty of detecting templates that don’t show up on the public web.  In my neck of the woods, students pay high fees to “celebrity” test prep experts who create personalized templates that are neither shared publicly nor repeated by future test takers.  This came up in an article by Sugene Kim which I’ll share in the comments.

Perhaps Pearson should go whole hog and bring in human raters for some or all responses in the writing section as well.

More on this in the days ahead.