Pearson’s Sarah Hughes gave a great webinar yesterday called “The Role of Automated, AI, and Human Scoring” (link). She discussed the ways in which Pearson spots what they call “Topic Templates” in the writing section of the PTE and what is done after they are detected. It was my first time to hear the term “topic templates” (as opposed to just “templates”), which is meant to distinguish between the memorization of typical discourse phrases and the memorization of a whole lot of generic junk into which a few topic-related phrases are plugged.
Pearson’s primary line of defense against topic templates seems to be a database of such templates created by a bot that crawls the web now and then. When signs of a template are detected in a student response, that response is sent to a human rater (or raters?) to determine if the answer is acceptable or should be given a score of zero. Note that most of the time human raters are not used to score the PTE writing section, which is handled entirely by AI.
I know I sound like a broken record every time this comes up, but I’d love to hear more about the detection of templates that aren’t widely circulated online. I’ve linked to it a million times, but Sugene Kim’s article about preparing for the TOEFL test “Gangnam Style” is required reading for anyone interested in this topic.
The article describes how, here in Korea, students prepping for a test hire a hotshot tutor – one mentioned in Kim’s article is nicknamed “The Writing Sniper” – to craft a handful of bespoke templates just for them. The students don’t get just the templates, of course, but also receive weeks of lessons about how to use them most effectively. Needless to say, the templates don’t show up in any databases possessed by the testing firms.
When Kim’s article was published I had some fun creating my own topic templates for this task. They fooled a lot of people with experience in the industry. You can read about my fun in a four-part series of blog posts starting over here.
The Gangnam approach somewhat suited the old TOEFL independent writing task, which had a lot of pretty formulaic questions like:
“Do you agree or disagree with the following statement? It is better to use printed materials such as books and articles to do research than it is to use the internet. Use specific reasons and examples to support your answer.”
It’s a bit less suitable for the IELTS, where the questions are sometimes (but not always) a bit more intricate. Like:
“In spite of the advances made in agriculture, many people around the world still go hungry. Why is this the case? What can be done about this problem?”
The PTE has also has somewhat intricate questions like:
“Tobacco, mainly in the form of cigarettes, is one of the most widely-used drugs in the world. Over a billion adults legally smoke tobacco every day. The long term health costs are high – for smokers themselves, and for the wider community in terms of health care costs and lost productivity. Do governments have a legitimate role to legislate to protect citizens from the harmful effects of their own decisions to smoke, or are such decisions up to the individual.”
Obviously the effectiveness of a topic template is blunted by the various aspects that the PTE and IELTS prompts the test taker to touch on. The PTE adds a firm word count limit to the mix which could further complicate things.
But the tests are still vulnerable. The hotshots are good at what they do. As they should be, considering the sky-high fees they sometimes command.
With that said, one is left wondering how the test makers detect templates that are not widely shared online and which are not reused by multiple students across numerous tests administrations. Without a relevant database and without human raters who might notice their stilted nature, Pearson likely requires an AI solution specifically designed to spot them. Is that part of the mix?
The continued use of human raters to score the IELTS might be an advantage, but I haven’t heard much from the IELTS partnership on this topic, even when it comes to widely circulated templates. Do they supplement the expertise of their humans with a database of templates each essay is compared to? Or are they entirely dependent on humans? Humans are good… but are they good enough to beat The Writing Sniper? That’s unclear.
I think it is also worth mentioning that such templates are best used by students with intermediate language skills who want to pass themselves off as advanced students. They have the ability to “fill in the blanks” of the templates with more than just a topic keyword, but with decent clauses or sentences. This makes detection trickier than you might expect.
Questions of equity linger in the back of my mind, as well. Do current detection methods focus on low-hanging fruit? Are they good at detecting low or no-budget test preppers who use stuff they find on social media but poor at stopping the techniques favored by preppers with a few thousand bucks to spend before taking a test?
Do share your own thoughts, if you have a moment. More on this in the future.
