Wrapping up this series on “templated responses” I want to share a few paragraphs recently added to the ETS website (via the new FAQ for TOEFL):
Some test takers use templates in the speaking or writing sections of the TOEFL iBT test. It can be considered a kind of “blueprint” that helps test takers organize their thoughts and write systematically within the limited time when responding to the speaking section or composing an essay.
However, there are risks to using templates. They can be helpful to establish a general structure for your response, but if they do more than that, you’re probably violating ETS testing policies. The test rules are very strict that you must be using your own words in your responses, and not those from others, or passages that you have previously memorized.
ETS uses detection software to identify writing passages that are similar to outside sources or other test takers’ responses. If you use a template, there is a high probability of providing responses similar to those of other test-takers. So we strongly recommend that you produce your own responses during the test itself.
This is significant, as is perhaps the first time ETS has directly referenced “templates” in a communication to test takers. The TOEFL Bulletin has long contained references to “memorized content,” but that’s not quite the same thing.
The verbiage may be tricky for some to fully grasp. Templates are “helpful to establish a general structure” but test takers “must be using [their] own words” in their responses. When does a template cross the line from being helpful to being a violation of ETS testing policies? That’s not immediately clear.
However, John Healy reminded me to check out “Challenges and Innovations in Speaking Assessment,” recently published by ETS via Routledge. In an insightful chapter, Xiaoming Xi, Pam Mollaun and Larry Davis categorize potential uses of “formulaic responses” in speaking answers and how they should be viewed by raters. Those categories are:
- Practiced lexical and grammatical chunks
- Practiced generic discourse markers
- Practiced task type-specific organizational frames
- Rehearsed generic response for a task type
- Heavily rehearsed content
- Rehearsed response
I think these categories are self-explanatory, but here are a few quick notes about what they mean:
- Formulaic expressions (chunks of sentences) stored in the test taker’s memory.
- Stuff like “in conclusion” and “another point worth mentioning…”
- Stuff like “The university will ___ because ____ and ____. The man disagrees because ___ and ___.”
- Like category three, but without blanks to be filled in.
- Like number 1, but the content “[differs] from formulaic expressions present in natural language use.” This content is “produced with little adaptation to suit the real task demands.”
- A response that is “identical or nearly identical to a known-source text.”
Dive into the chapter for more detailed descriptions.
It is recommended that categories 2 and 3 be scored on merit, that category 1 be scored on merit if the chunks do not match known templates, that spontaneous content in categories 4 and 5 be scored on merit (if any exists), and that category 6 be scored as a zero.
Very sensible guidelines. But putting them into use? The chapter notes:
“Unfortunately, the guidelines did not remove the challenge of detecting what particular language has been prepared in advance. Certain types of delivery features may be suggestive of memorized content, such as pausing or other fluency features, or use of content that appears ill-fitting or otherwise inappropriate. Nonetheless, it can be quite challenging to detect formulaic responses, especially if the response references a previously unknown source text.”
According to the article, ETS experimented with supplying raters with examples of memorized content to refer to while making decisions, but that didn’t work out for a variety of reasons. Raters became overly sensitive to memorized content. The rating process became too slow.
The authors wrap up the article by making a few suggestions for future study, including redesigned item types and AI tools.
To me, AI tools are a must in 2024, both in terms of correctly identifying overly gamed responses and avoiding false positives. A quick glance at Glassdoor reviews suggests that response scorers (of a variety of tests) are often low-paid workers who sometimes feel pressure to work quickly. Tools that help them work more efficiently, accurately and swiftly seem like a good idea.