I took the PTE Academic yesterday at the Herald Test Center near Apgujeong Station! A few notes while everything is fresh in my mind:

  1. Check-in was efficient and quick. The center’s equipment is modern and clean and there are dividers between each computer. The staff know what they are doing. A bathroom is across the hall. That said, it doesn’t really compare to the ultra-lux “Pearson Professional Center” near Seoul City Hall. If you have a choice, I do recommend utilizing that testing center.

 

  1. I appreciate how Pearson sends detailed instructions to help test takers find the test center. I received general directions, directions from both city airports, and directions from the main train station.

 

  1. I also remain a fan of the laminated “helpful suggestions” sheets passed out to test takers as they wait to enter the testing space. The sheets point out that it is never necessary to yell.

 

  1. The test center was full, just like when I took the PTE Core last month. This speaks to the growing popularity of the test in this market. It also suggests that the city could support a third test center.

 

  1. I took my time reading the pre-test instructions and let the clock run out while the speaking instructions were displayed. This made it possible to answer the most challenging speaking questions without hearing chatter from other test takers.  I’ll tell my students to do the same.

 

  1. Is the speaking section adaptive? Like when I took the CORE test, I didn’t get any “describe this picture” questions. I was only asked to describe graphs (sometimes tricky ones). I think I overheard one of the other testers describing an actual picture.

 

  1. Again:  I warn test takers to keep within the recommended word counts when writing responses.  There are penalties (sometimes serious ones) for exceeding them.  This makes the PTE unlike the TOEFL and IELTS tests.

 

  1. I know that the use of real-life audio snippets (from TV, radio, lecture halls, etc) is a selling point of the test, but I’m still not a fan of how the audio files are sometimes potato quality. Also: a couple of the recordings I heard, presumably taken from a TV broadcast, were accompanied by background music.

 

  1. I was happy to encounter a couple of very challenging reading and listening questions. Some of my answers were guesses.

 

  1. One of the reading questions was an “ultra-Britishism.” I’m fairly certain it depended on knowing one weird difference between British and North American English. I swore under my breath and chose the answer 99% of Americans would pick. I probably got that one wrong.

 

  1. Tutors: teach your students time management techniques.  Test takers will encounter both timers for individual tasks, and timers covering multiple tasks. I went in mostly blind and really had no idea how much time I could spend on each task.

 

  1. Interestingly, a message at the end of the test said my results would arrive in 5 days. Pearson generally provides results in 2 days.

 

I’ll share my scores in a future post.

 

IDP Education has joined the discussion on “templated responses.” Australasia/Japan head Michael James noted in an article shared to LinkedIn that:

“AI’s role in high-stakes language testing has gained attention recently, particularly after a computer-marked test revised its scoring process to include human evaluators. This change has ignited a debate on this platform about a computer’s ability to identify templated responses.”

James points out that:

“The importance of human marking in high-stakes English language assessment cannot be overstated. IELTS examiners are highly trained language experts who bring a nuanced understanding and contextual awareness that AI systems lack. They can discern not only the grammatical correctness and structural integrity of a response, but also the underlying intent, creativity, and coherence of the content. This real-time, human-centred approach aims to reveal a student’s true abilities and potential.”

His work refers to the “cautiously curious approach” that the IELTS partnership has used in the past to describe its approach to AI.

There is more worth quoting here, but it is probably best to check it out yourself at the link above.

Moving forward, I would love to hear more about the humans who do this sort of work. Not just the humans who rate IELTS responses, but those who rate responses in all sorts of tests. Who are they? What makes them “highly trained experts”? How do they discern X, Y, Z? Are they under pressure to work quickly? These are questions asked by not only score users, but (more important and more frequently) by test takers themselves.

Wrapping up this series on “templated responses” I want to share a few paragraphs recently added to the ETS website (via the new FAQ for TOEFL):

Some test takers use templates in the speaking or writing sections of the TOEFL iBT test. It can be considered a kind of “blueprint” that helps test takers organize their thoughts and write systematically within the limited time when responding to the speaking section or composing an essay.

However, there are risks to using templates. They can be helpful to establish a general structure for your response, but if they do more than that, you’re probably violating ETS testing policies. The test rules are very strict that you must be using your own words in your responses, and not those from others, or passages that you have previously memorized.

ETS uses detection software to identify writing passages that are similar to outside sources or other test takers’ responses. If you use a template, there is a high probability of providing responses similar to those of other test-takers. So we strongly recommend that you produce your own responses during the test itself.

This is significant, as is perhaps the first time ETS has directly referenced “templates” in a communication to test takers.  The TOEFL Bulletin has long contained references to “memorized content,” but that’s not quite the same thing.

The verbiage may be tricky for some to fully grasp.  Templates are “helpful to establish a general structure” but test takers “must be using [their] own words” in their responses.  When does a template cross the line from being helpful to being a violation of ETS testing policies?  That’s not immediately clear.

However, John Healy reminded me to check out “Challenges and Innovations in Speaking Assessment,” recently published by ETS via Routledge.  In an insightful chapter, Xiaoming Xi, Pam Mollaun and Larry Davis categorize potential uses of “formulaic responses” in speaking answers and how they should be viewed by raters.  Those categories are:

  1. Practiced lexical and grammatical chunks
  2. Practiced generic discourse markers
  3. Practiced task type-specific organizational frames
  4. Rehearsed generic response for a task type
  5. Heavily rehearsed content
  6. Rehearsed response

I think these categories are self-explanatory, but here are a few quick notes about what they mean:

  1. Formulaic expressions (chunks of sentences) stored in the test taker’s memory.
  2. Stuff like “in conclusion” and “another point worth mentioning…”
  3. Stuff like “The university will ___ because ____ and ____.  The man disagrees because ___ and ___.”
  4. Like category three, but without blanks to be filled in.
  5. Like number 1, but the content “[differs] from formulaic expressions present in natural language use.”  This content is “produced with little adaptation to suit the real task demands.”
  6. A response that is “identical or nearly identical to a known-source text.”

Dive into the chapter for more detailed descriptions.

It is recommended that categories 2 and 3 be scored on merit, that category 1 be scored on merit if the chunks do not match known templates, that spontaneous content in categories 4 and 5 be scored on merit (if any exists), and that category 6 be scored as a zero.

Very sensible guidelines.  But putting them into use?  The chapter notes:

“Unfortunately, the guidelines did not remove the challenge of detecting what particular language has been prepared in advance. Certain types of delivery features may be suggestive of memorized content, such as pausing or other fluency features, or use of content that appears ill-fitting or otherwise inappropriate. Nonetheless, it can be quite challenging to detect formulaic responses, especially if the response references a previously unknown source text.”

According to the article, ETS experimented with supplying raters with examples of memorized content to refer to while making decisions, but that didn’t work out for a variety of reasons. Raters became overly sensitive to memorized content. The rating process became too slow.

The authors wrap up the article by making a few suggestions for future study, including redesigned item types and AI tools.

To me, AI tools are a must in 2024, both in terms of correctly identifying overly gamed responses and avoiding false positives.  A quick glance at Glassdoor reviews suggests that response scorers (of a variety of tests) are often low-paid workers who sometimes feel pressure to work quickly.  Tools that help them work more efficiently, accurately and swiftly seem like a good idea.

It is worth sharing a few notes from a long article published by Pearson in the summer.  It provides more information about the topics discussed in Jarrad Merlo’s webinar about the introduction of human raters to the PTE tests.

The article describes how a “gaming detection system” has been developed to aid in the evaluation of two of the speaking questions and one of the writing questions on the PTE. This system gives each response a numerical score from 0 to 1, with 0 indicating the complete absence of gaming and 1 indicating significant evidence of gaming.

This numerical approach seems wise, as the lines between acceptable and unacceptable use of “templated responses” in a response are sometimes blurred.  In a future post, I’ll summarize some research from ETS that discusses this topic.

Meanwhile, Pearson’s article notes that:

“While more rudimentary systems may rely on a simple count of words matching known templates, PTE Academic’s gaming detection system has been designed to consider a number of feature measurements that quantify the similarity of the response to known templates, the amount of authentic content present, the density of templated content, and the coherence of the response”

The article goes on to describe how the results of the AI checks are passed along to human raters, to aid in their decision-making regarding the content of responses.  It notes that the newly-implemented system:

“enables raters to make better informed content scoring decisions by leveraging the Phase I gaming detection systems to provide them with information about the about the [sic] extent of gaming behaviours detected in the response.”

That’s fascinating.  I’m not aware of another system where human raters can make use of AI-generated data when making decisions about test taker responses.

The article notes that human raters will not check all written responses for templated content.  Checks of most responses will be done entirely by AI that has been trained on a regularly-updated database of templates discovered via crawls of the web and social media.

A challenge with this approach that goes unmentioned is the difficulty of detecting templates that don’t show up on the public web.  In my neck of the woods, students pay high fees to “celebrity” test prep experts who create personalized templates that are neither shared publicly nor repeated by future test takers.  This came up in an article by Sugene Kim which I’ll share in the comments.

Perhaps Pearson should go whole hog and bring in human raters for some or all responses in the writing section as well.

More on this in the days ahead.

I have updated my TOEFL writing templates for 2021. In the attached video, you’ll find templates for both the independent and integrated essays.  I’ve adjusted them only slightly for this year… but I think they are a bit better than the 2020 versions.  I’ll probably make a video containing all of the 2021 speaking templates as well, so keep an eye on the channel.

Over the next few days I will adjust all of the static webpage articles so that they include the new templates.

Ha ha.  I am a TOEFL essay machine now.  This took about three minutes to create using my fake essay template, and I think it looks pretty decent.

The prompt is:

Do you agree or disagree with the following statement? It is better for children to grow up in the countryside than in a large city. Use specific reasons and examples to develop your essay.

The Essay is:

A lot of people today think that we should live in the city.  However, I strongly believe that it is much better for kids to live in the country for two reasons.  First, it leads to a lot of great job opportunities.  Second, it vastly improves our health and wellbeing, which a lot of people are struggling with nowadays.  To be fair, a lot of older people have the traditional view that cities are the best place for young people to live.  That said, I think this viewpoint is outdated and quite useless in today’s society.

First, life in the countryside can improve our range of job opportunities in the future.  As I implied above, people my parent’s age (and older) think that living in the countryside is actually quite dangerous.  When I was young and they had a lot of influence over my world view, I actually had the same opinion.  At that time, I thought the lack of businesses in the country would actually make it harder for me to get a job, and so I was hostile toward it.  However, after I entered college and my social network broadened, I realized the unique benefits of rural life.  Now I realize that the presence of agriculture can help us find employment in high paying fields.  For example, my young cousin makes a lot of money because he works in a field related to growing organic crops.  His experience changed my perspective, and now I am focusing on farming at university in the hope of achieving the same thing.

Second, life in the countryside has a noticeable effect on our physical health and maybe even our mental health.  I actually read a story about this in the Village Voice Newspaper a few months ago.  It pointed out that if we properly use hiking trails we can avoid the poor health that a lot of people are dealing with nowadays.  The article claimed that 75% of Americans think that the best way of staying fit is making use of rural sports.  Medical experts who reviewed the study results agreed, and suggested that rural lifestyles will have an even greater impact in the future because of the clean air in the countryside.  Consequently, I strongly feel that benefiting from life away from crowded cities is a fantastic way to stay healthy.

In conclusion, I think that it is best for young people to live in the countryside.  This is because it can lead to gainful employment, and because it has a positive impact on our minds and bodies.

(you can also read parts one, two,  and four of this series!)

Okay, I’m having fun with the Gangnam style TOEFL template I generated yesterday.  This time I tackled the second prompt in my collection.  Obviously it has a lot of overlap, since both deal with the Internet.  Next time I think I will delete the final sentence from the introduction. It lays the template on a bit too thick.  I’ll replace it with nothing, and just jump to the body after the thesis statement.

The prompt is:

Do you agree or disagree with the following statement? It is better to use printed materials such as books and articles to do research than it is to use the internet. Use specific reasons and examples to support your answer.

The “fake essay” is:

A lot of people today think that using online materials for research is a bad idea.  However, I strongly believe that using the Internet for research is wise for two reasons.  First, it leads to a lot of great job opportunities.  Second, it vastly improves our health and wellbeing, which a lot of people are struggling with nowadays.  To be fair, a lot of older people have the traditional view that websites are unreliable.  That said, I think this viewpoint is outdated and quite useless in today’s society.

First, using the Internet for researching topics can improve our range of job opportunities in the future.  As I implied above, people my parent’s age (and older) think that the web is actually quite dangerous.  When I was young and they had a lot of influence over my world view, I actually had the same opinion.  At that time, I thought relying on unreliable online sources would actually make it harder for me to get a job, and so I was hostile toward it.  However, after I entered college and my social network broadened, I realized the unique benefits of cutting edge research that is published online.  Now I realize that learning about the latest academic developments online can help us find employment in high paying fields.  For example, my young cousin makes a lot of money because he works in a field related to crypto-currency.  His experience changed my perspective, and now I am focusing on emerging web-based technologies at university in the hope of achieving the same thing.

Second, medical websites have a noticeable effect on our physical health and maybe even our mental health.  I actually read a story about this in the Village Voice Newspaper a few months ago.  It pointed out that if we properly use websites that report on health trends we can avoid the poor health that a lot of people are dealing with nowadays.  The article claimed that 75% of Americans think that the best way of staying fit is making use of the Internet.  Medical experts who reviewed the study results agreed, and suggested that websites will have an even greater impact in the future because of the number of doctors who are online.  Consequently, I strongly feel that benefiting from online research is a fantastic way to stay healthy.

In conclusion, I think that researching online is beneficial.  This is because it can lead to gainful employment, and because it has a positive impact on our minds and bodies.

(you can also read parts one,  three and four of this series!)

 

This week I was lucky enough to again have an opportunity to attend a workshop hosted by ETS for TOEFL teachers.  Here is a quick summary of some of the questions that were asked by attendees of the workshop.  Note that the answers are not direct quotes, unless indicated.

 

Q:  Are scores adjusted statistically for difficulty each time the test is given?

A: Yes.  This means that there is no direct conversion from raw to scaled scores in the reading and listening section.  The conversion depends on the performance of all students that week.

 

Q: Do all the individual reading and listening questions have equal weight?

A: Yes.

 

Q:  When will new editions of the Official Guide and Official iBT Test books be published?

A:  There is no timeline.

 

Q:  Are accents from outside of North America now used when the question directions are given on the test?

A: Yes.

 

Q:  How are the scores from the human raters and the SpeechRater combined?

A:  “Human scores and machines scores are optimally weighted to produce raw scores.”  This means ETS isn’t really going to answer this question.

 

Q: Can the human rater override the SpeechRater if he disagrees with its score?

A: Yes.

 

Q:  How many different human raters will judge a single student’s speaking section?

A:  Each question will be judged by a different human.

 

Q:  Will students get a penalty for using the same templates as many other students?

A:   Templates “are not a problem at all.”

 

Q: Why were the question-specific levels removed from the score reports?

A: That information was deemed unnecessary.

 

Q:  Is there a “maximum” word count  in the writing section?

A:  No.

 

Q:  Is it always okay to pick more than one choice in multiple choice writing prompts?

A:  Yes.