Express scoring for the TOEFL is available again.  Test takers who pay a fee of $149 will receive their scores within 24 hours of taking the test.  Otherwise, scores are reported in 4-8 days. This option first appeared near the end of 2024, but was quickly withdrawn. At that time the fee was $99.

Interestingly, this option only appears when the test is to be taken at a test center. I don’t see it when attempting to book an at-home test.

Having this option is better than not having it.  But note that IELTS scores are now delivered in 1-2 days without an extra fee and that Pearson promises to deliver PTE scores in 2 days without asking for any additional payment either.

There is a wonderful new article in Language Testing Journal by Emma Bruce, Karen Dunn and Tony Clark which explores test score validity periods for high-stakes tests.  It isn’t in open access, though, so you’ll need institutional access or a healthy billfold to read it.

As most readers know, institutions and regulatory bodies generally won’t accept scores from tests taken more than two years ago.  This is based on research and advice from test makers, though the authors note that:

“While the role of test providers and language testing researchers is not to set the policy for test score use, it is becoming apparent that the messaging surrounding validity periods may benefit from consideration through a contemporary lens. While it is certain that test developers have a responsibility to communicate the idea that the fidelity of a test score in reflecting test-takers’ language proficiency may change over time depending on the circumstances of the test-taker in the period between taking the test and using the score, it is of equal import to communicate–especially to policymakers–the possibility of adapting the 2-year requirement according to risk or need in any given setting.”

Unmentioned is the fact that even if institutions desire to accept scores that are older than two years, it can be exceptionally difficult to actually receive those scores.  Correct me if I’m wrong, but I believe that none of the big four tests (TOEFL, IELTS, PTE and DET) allow test takers to send scores to recipients more than two years after a test date. In this way, it seems like the test makers are semi-enforcing a two-year validity period. I can’t even view the scores from my 2022 attempt at the TOEFL within my account on the ETS website.

After I return from my holiday, I will probably take the Duolingo English Test. Let me know if there is anything I should keep an eye out for. I’ve taken this test in the past, but not since the secondary camera requirement was introduced. I’m curious to see how that feels. I haven’t experienced the latest round (several rounds?) of item revisions either.

I’d like to take the TOEFL Essentials Test. A few days ago I was caught flat footed when someone asked me to help them prep for it. And I fear that one day the test will disappear and I’ll miss my chance.

I’m curious about how Pearson does at-home testing.

Leave a comment if there are any other tests I should try to check out. In 2024, I took the following tests:

  • Password Plus
  • Skills for English SELT
  • TOEFL iBT
  • PTE Core
  • PTE Academic
  • EnglishScore (all 3)
  • MET
  • LANGUAGECERT Academic

According to reports that rolled in last week, the Educational Testing Service (ETS) has begun training individuals from outside the USA to score TOEFL test taker responses and to serve as scoring leaders.

This seems to represent something of a shift as far as the TOEFL scoring process goes.  To date, responses have been scored solely by individuals physically located in the USA (and in possession of a degree from an American university).  It is unclear at this time which countries the new raters will be located in.

Update:  For a little more confirmation, head over to the ETS Glassdoor page.

IDP Education has joined the discussion on “templated responses.” Australasia/Japan head Michael James noted in an article shared to LinkedIn that:

“AI’s role in high-stakes language testing has gained attention recently, particularly after a computer-marked test revised its scoring process to include human evaluators. This change has ignited a debate on this platform about a computer’s ability to identify templated responses.”

James points out that:

“The importance of human marking in high-stakes English language assessment cannot be overstated. IELTS examiners are highly trained language experts who bring a nuanced understanding and contextual awareness that AI systems lack. They can discern not only the grammatical correctness and structural integrity of a response, but also the underlying intent, creativity, and coherence of the content. This real-time, human-centred approach aims to reveal a student’s true abilities and potential.”

His work refers to the “cautiously curious approach” that the IELTS partnership has used in the past to describe its approach to AI.

There is more worth quoting here, but it is probably best to check it out yourself at the link above.

Moving forward, I would love to hear more about the humans who do this sort of work. Not just the humans who rate IELTS responses, but those who rate responses in all sorts of tests. Who are they? What makes them “highly trained experts”? How do they discern X, Y, Z? Are they under pressure to work quickly? These are questions asked by not only score users, but (more important and more frequently) by test takers themselves.

Wrapping up this series on “templated responses” I want to share a few paragraphs recently added to the ETS website (via the new FAQ for TOEFL):

Some test takers use templates in the speaking or writing sections of the TOEFL iBT test. It can be considered a kind of “blueprint” that helps test takers organize their thoughts and write systematically within the limited time when responding to the speaking section or composing an essay.

However, there are risks to using templates. They can be helpful to establish a general structure for your response, but if they do more than that, you’re probably violating ETS testing policies. The test rules are very strict that you must be using your own words in your responses, and not those from others, or passages that you have previously memorized.

ETS uses detection software to identify writing passages that are similar to outside sources or other test takers’ responses. If you use a template, there is a high probability of providing responses similar to those of other test-takers. So we strongly recommend that you produce your own responses during the test itself.

This is significant, as is perhaps the first time ETS has directly referenced “templates” in a communication to test takers.  The TOEFL Bulletin has long contained references to “memorized content,” but that’s not quite the same thing.

The verbiage may be tricky for some to fully grasp.  Templates are “helpful to establish a general structure” but test takers “must be using [their] own words” in their responses.  When does a template cross the line from being helpful to being a violation of ETS testing policies?  That’s not immediately clear.

However, John Healy reminded me to check out “Challenges and Innovations in Speaking Assessment,” recently published by ETS via Routledge.  In an insightful chapter, Xiaoming Xi, Pam Mollaun and Larry Davis categorize potential uses of “formulaic responses” in speaking answers and how they should be viewed by raters.  Those categories are:

  1. Practiced lexical and grammatical chunks
  2. Practiced generic discourse markers
  3. Practiced task type-specific organizational frames
  4. Rehearsed generic response for a task type
  5. Heavily rehearsed content
  6. Rehearsed response

I think these categories are self-explanatory, but here are a few quick notes about what they mean:

  1. Formulaic expressions (chunks of sentences) stored in the test taker’s memory.
  2. Stuff like “in conclusion” and “another point worth mentioning…”
  3. Stuff like “The university will ___ because ____ and ____.  The man disagrees because ___ and ___.”
  4. Like category three, but without blanks to be filled in.
  5. Like number 1, but the content “[differs] from formulaic expressions present in natural language use.”  This content is “produced with little adaptation to suit the real task demands.”
  6. A response that is “identical or nearly identical to a known-source text.”

Dive into the chapter for more detailed descriptions.

It is recommended that categories 2 and 3 be scored on merit, that category 1 be scored on merit if the chunks do not match known templates, that spontaneous content in categories 4 and 5 be scored on merit (if any exists), and that category 6 be scored as a zero.

Very sensible guidelines.  But putting them into use?  The chapter notes:

“Unfortunately, the guidelines did not remove the challenge of detecting what particular language has been prepared in advance. Certain types of delivery features may be suggestive of memorized content, such as pausing or other fluency features, or use of content that appears ill-fitting or otherwise inappropriate. Nonetheless, it can be quite challenging to detect formulaic responses, especially if the response references a previously unknown source text.”

According to the article, ETS experimented with supplying raters with examples of memorized content to refer to while making decisions, but that didn’t work out for a variety of reasons. Raters became overly sensitive to memorized content. The rating process became too slow.

The authors wrap up the article by making a few suggestions for future study, including redesigned item types and AI tools.

To me, AI tools are a must in 2024, both in terms of correctly identifying overly gamed responses and avoiding false positives.  A quick glance at Glassdoor reviews suggests that response scorers (of a variety of tests) are often low-paid workers who sometimes feel pressure to work quickly.  Tools that help them work more efficiently, accurately and swiftly seem like a good idea.

It is worth sharing a few notes from a long article published by Pearson in the summer.  It provides more information about the topics discussed in Jarrad Merlo’s webinar about the introduction of human raters to the PTE tests.

The article describes how a “gaming detection system” has been developed to aid in the evaluation of two of the speaking questions and one of the writing questions on the PTE. This system gives each response a numerical score from 0 to 1, with 0 indicating the complete absence of gaming and 1 indicating significant evidence of gaming.

This numerical approach seems wise, as the lines between acceptable and unacceptable use of “templated responses” in a response are sometimes blurred.  In a future post, I’ll summarize some research from ETS that discusses this topic.

Meanwhile, Pearson’s article notes that:

“While more rudimentary systems may rely on a simple count of words matching known templates, PTE Academic’s gaming detection system has been designed to consider a number of feature measurements that quantify the similarity of the response to known templates, the amount of authentic content present, the density of templated content, and the coherence of the response”

The article goes on to describe how the results of the AI checks are passed along to human raters, to aid in their decision-making regarding the content of responses.  It notes that the newly-implemented system:

“enables raters to make better informed content scoring decisions by leveraging the Phase I gaming detection systems to provide them with information about the about the [sic] extent of gaming behaviours detected in the response.”

That’s fascinating.  I’m not aware of another system where human raters can make use of AI-generated data when making decisions about test taker responses.

The article notes that human raters will not check all written responses for templated content.  Checks of most responses will be done entirely by AI that has been trained on a regularly-updated database of templates discovered via crawls of the web and social media.

A challenge with this approach that goes unmentioned is the difficulty of detecting templates that don’t show up on the public web.  In my neck of the woods, students pay high fees to “celebrity” test prep experts who create personalized templates that are neither shared publicly nor repeated by future test takers.  This came up in an article by Sugene Kim which I’ll share in the comments.

Perhaps Pearson should go whole hog and bring in human raters for some or all responses in the writing section as well.

More on this in the days ahead.

Moving along, here’s a quick list of changes to Chapters 3 (Listening) and Chapters 4 (Reading) in the new Official Guide to the TOEFL.  Again, I’m focusing on stuff other than the major changes to the test that started back in July.

You can read the whole blog series on changes at the following links: chapter one, chapter two, chapter three and four, chapter five, the tests.

Chapter 3

Pages 122-123:

“Painters and Painting” is added as a potential lecture topic.

“Computer Science” is removed as a potential lecture topic.

“TV/Radio as mass communication” is now “media broadcasting and digital media as mass communication.”

Chapter 4:

Page 171:  Again, the length of the reading passage in question #2 is listed as 90-115 words.

Page 177:  Same as above, for question #3.

Page 178:  The sample reading for question #3 is now a single paragraph (same content, though)

Page 189:  Again, the reading passages are listed as 90-115 words.

Our friends from My Speaking Score were cool enough to provide a discount code exclusive to readers of this blog.  Register here and at the time of purchase use the code TESTRESOURCES to save 10% on your purchase of SpeechRater credits.

My Speaking score makes use of the same SpeechRater AI used by ETS to score the real TOEFL.  You can submit your practice responses and get an accurate score prediction, along with specific scores for metrics like fluency, pronunciation, coherence and grammar.  I use it with all my students.  

The site provides questions you can answer or you can just use your microphone to record responses to your own practice questions (like, say, from the TPO sets or Official Guide).

Teachers who work with students in Japan will appreciate this new article in “Language Testing in Asia.” It uses results from the TOEFL (and other tests) to create profiles of language learners in that country. Not surprisingly, the profiles match what our experience tells us.

I like this section at the end:

“When the uneven profiles come from true skill imbalance, learners and teachers may need to decide whether to focus on weaker or stronger skills for further study based on their contexts and needs. While one direction is to improve a weaker skill, a stronger skill could be further improved to compensate for the weaker one.”

Note the last part.

About a third of the students I work with nowadays are from Japan. Most of them come to me for help with the writing section. After looking at their score history I usually offer to help with the speaking section as well. The response is often something like “It is impossible for a student from Japan to score more than 23 in the speaking section, so I’m not going to work on it any more.” Even though their target is 110 overall, they’d rather just max out the other sections than “waste time” on the speaking prep. It’s an interesting approach.  It may be a correct one.

Gary J Ockey and Evgeny Chukharev-Hudilainen published an interesting article in Applied Linguistics that suggests a few interesting things. It highlights Ockey’s earlier research that suggests that the asynchronous tasks used in the TOEFL iBT speaking section “may not sufficiently assess interactional competence.”

More importantly, it compares the use of a human interviewer (a la IELTS) to a Speech Delivery System (like a chatbot) to elicit spoken English from test-takers. It seems to suggest that “the computer partner condition was found to be more dependable than the human partner condition for assessing interactional competence.” And that both were equal in areas like pronunciation, grammar and vocabulary.

Aha! This information could be used to create a better TOEFL test or a better IELTS test. Someone should let the test makers know.

No need, though, as I read at the end that “this research was funded by the ETS under a Committee of Examiners and the Test of English as a Foreign Language research grant.”

Implement it right away, I say.

I mention this now because the research will be presented tomorrow at an event hosted by the University of Melbourne.