IDP IELTS is open for business in China!  Test takers can now register for and take the IELTS through IDP Education. The British Council no longer has a monopoly on IELTS testing in this country.  Apparently all the test spots between now and the end of the year sold out in a matter of days.  Remember how I mentioned that it is hard to book seats to take the IELTS (and TOEFL) in China?

A few notes, culled from the Chinese blogosphere:

  1. The fee is 2170 RMB, which is about $298 USD.  That’s the exact same price as taking the test through the British Council.
  2. Testing is currently available in 14 cities across 10 provinces.
  3. There is no interaction with the NEEA when it comes to registering for the test.
  4. IDP apparently offers students one free reschedule.  People like that a lot.
  5. Testing is done all in one day, for the most part.  People like that a lot as well.

The overall reaction seems extremely positive.  Tutoring center staff I’ve talked to report that they’ve already sent dozens of students to register for an IDP-administered IELTS.  Good for IDP. I think this is a big win for that firm. Just what they need after a challenging year (for the whole industry).

I haven’t been able to track down the name of the test center operator that IDP has partnered with.

Language Testing  has just published a review of the CELPIP Test by Coral Yiwei Qin and Beverly Baker.

The CELPIP is mostly used for Canadian immigration and is owned by Prometric, an American company.  The review indicates that it can be taken at 97 test centers in Canada and about 43 in the rest of the world.

The review provides a detailed overview of the test’s content before appraising the content.  A few points stood out when I read the review:

  1. According to the authors, the test lacks assessment of interactional competence. They note:  “the purpose of CELPIP-General is to evaluate test takers’ English abilities in everyday situations, which inevitably encompasses interactional competence as a part of the target language use domain. Nevertheless, interactional competence is only captured in a limited way in CELPIP-General.”
  2. The authors feel that the test lacks content that can assess test takers at advanced levels.  They note that while the test assigns scores which correspond to CLB levels 10 and above, they “did not find sufficient evidence suggesting substantial content coverage at these advanced levels of performance.”  The authors note that this could be a political consideration, as IRCC desires that tests match CLB levels all the way to level 12.
  3. The authors note that much of the content “was distinctly Canadian in nature.”  Canadian situations are referenced, there is evidence of a Canadian accent, and the test includes lines like “That was good, eh?”

There is more in the review, both positive and negative.  Do check it out. It is in open access.

Continuing along with the Norton Library Podcast, this month I read Bram Stoker’s “Dracula.”  You can check out the podcast episodes starting here.

One unique feature of this classic horror novel is that it is an epistolary novel.  That refers to how the book is presented as a collection of letters, diary entries, phonograph transcripts, newspaper articles and telegrams written by characters in the novel.  While the topic of vampires has likely never appeared on the TOEFL, I am quite certain that at some point there has been an article or lecture about this kind of writing.  So instead of reaching for Dracula, perhaps take a moment to sharpen your reading skills by exploring the wikipedia article on this topic.

Next, I read the 16 November 2023 issue of the London Review of Books.  Yeah… I have another pile of unread magazines.  Fortunately, this is one of my favorite publications and I look forward to working through all of the old issues on my shelf.  A few stories stood out in this issue:

  • Kettle of Vultures” is a quick look at the history of interest.  The charging of it, the collecting of it, the religious implications of it, and more.
  • Red Flag, Green Light” is the story of famous fraudster John Ackah Blay-Miezah.  He may have invented the famous “Nigerian Prince” scam.
  • I Thought You Were Incredible” is a fun read for film fans.  It’s a quick overview of the life of Elizabeth Taylor, with special consideration paid to her relationship with Montgomery Clift.

More of this sort of thing in 30 days.

Below is a copy of my PTE Academic score report.  For my full report on taking the test, click here.

I really like that the score doesn’t contain any personal information (mailing address, passport number, etc).  That makes it easy to share online.  My TOEFL report contains my home address and part of my passport number, so I have to manually censor it before sharing it.

My scores arrived about 45 hours after I completed the test.  Interestingly, I got the standard “You’ve just completed your PTE Test” about five minutes before the scores arrived. 

Note that PTE scores can be sent to an unlimited number of institutions at any time at no cost. That makes the PTE a more attractive option than the TOEFL test, which charges a hefty fee to send scores to institutions after the test has been taken (four score recipients can be selected for free before the test has been taken).

I summarized my experience taking the test a few days ago, but a few more things come to mind:

  1. I didn’t get any unscored “summarize the conversation” or “respond to a situation” speaking questions.  I guess those have been cycled out.  The latter is used in the PTE Core test, if I remember correctly.
  2. I didn’t use any “templated responses” when responding to test questions.
  3. I didn’t get any videos in the listening section.

Let me know if you have any questions about the test.

I hope to take a different test in the next week or two.  More on that later.

I took the PTE Academic yesterday at the Herald Test Center near Apgujeong Station! A few notes while everything is fresh in my mind:

  1. Check-in was efficient and quick. The center’s equipment is modern and clean and there are dividers between each computer. The staff know what they are doing. A bathroom is across the hall. That said, it doesn’t really compare to the ultra-lux “Pearson Professional Center” near Seoul City Hall. If you have a choice, I do recommend utilizing that testing center.

 

  1. I appreciate how Pearson sends detailed instructions to help test takers find the test center. I received general directions, directions from both city airports, and directions from the main train station.

 

  1. I also remain a fan of the laminated “helpful suggestions” sheets passed out to test takers as they wait to enter the testing space. The sheets point out that it is never necessary to yell.

 

  1. The test center was full, just like when I took the PTE Core last month. This speaks to the growing popularity of the test in this market. It also suggests that the city could support a third test center.

 

  1. I took my time reading the pre-test instructions and let the clock run out while the speaking instructions were displayed. This made it possible to answer the most challenging speaking questions without hearing chatter from other test takers.  I’ll tell my students to do the same.

 

  1. Is the speaking section adaptive? Like when I took the CORE test, I didn’t get any “describe this picture” questions. I was only asked to describe graphs (sometimes tricky ones). I think I overheard one of the other testers describing an actual picture.

 

  1. Again:  I warn test takers to keep within the recommended word counts when writing responses.  There are penalties (sometimes serious ones) for exceeding them.  This makes the PTE unlike the TOEFL and IELTS tests.

 

  1. I know that the use of real-life audio snippets (from TV, radio, lecture halls, etc) is a selling point of the test, but I’m still not a fan of how the audio files are sometimes potato quality. Also: a couple of the recordings I heard, presumably taken from a TV broadcast, were accompanied by background music.

 

  1. I was happy to encounter a couple of very challenging reading and listening questions. Some of my answers were guesses.

 

  1. One of the reading questions was an “ultra-Britishism.” I’m fairly certain it depended on knowing one weird difference between British and North American English. I swore under my breath and chose the answer 99% of Americans would pick. I probably got that one wrong.

 

  1. Tutors: teach your students time management techniques.  Test takers will encounter both timers for individual tasks, and timers covering multiple tasks. I went in mostly blind and really had no idea how much time I could spend on each task.

 

  1. Interestingly, a message at the end of the test said my results would arrive in 5 days. Pearson generally provides results in 2 days.

 

I’ll share my scores in a future post.

 

Shares of IDP Education finished the day at $12.29 a piece.  That’s the lowest they’ve been since March of 2020.  They’re down 43% in the past 12 months, and down 68% since their pandemic high of $38.88.

IDP’s ticker price has always been my personal canary in the coal mine when it comes to legacy English tests.  A few things are worth noting:

  1. Given the most recent regulatory changes in Canada, I feel pretty certain that Duolingo will come to dominate testing for that country in the very near future. I’ve written about this topic at length in recent weeks. To me, this seems like a “break glass in case of emergency” moment for the legacy test makers (and anyone else who depends on Canada-bound students to keep up their volumes).  And yet it appears that the glass remains unbroken.  Tellingly, I still get messages along the lines of “so what do you think of this Duolingo offering?” from people who ought to be better informed. 
  2. The defeat of the caps in Australia has led to increased uncertainty.  Test watchers outside of that country might not understand that the opposition, responsible for the defeat, has argued that the government is not going far enough on this file.  Check out ICEF Monitor’s current lead story for more on this.

Tough times for some testers.  Good times for others.  I’m happy to have something to write about every day.

IDP Education has joined the discussion on “templated responses.” Australasia/Japan head Michael James noted in an article shared to LinkedIn that:

“AI’s role in high-stakes language testing has gained attention recently, particularly after a computer-marked test revised its scoring process to include human evaluators. This change has ignited a debate on this platform about a computer’s ability to identify templated responses.”

James points out that:

“The importance of human marking in high-stakes English language assessment cannot be overstated. IELTS examiners are highly trained language experts who bring a nuanced understanding and contextual awareness that AI systems lack. They can discern not only the grammatical correctness and structural integrity of a response, but also the underlying intent, creativity, and coherence of the content. This real-time, human-centred approach aims to reveal a student’s true abilities and potential.”

His work refers to the “cautiously curious approach” that the IELTS partnership has used in the past to describe its approach to AI.

There is more worth quoting here, but it is probably best to check it out yourself at the link above.

Moving forward, I would love to hear more about the humans who do this sort of work. Not just the humans who rate IELTS responses, but those who rate responses in all sorts of tests. Who are they? What makes them “highly trained experts”? How do they discern X, Y, Z? Are they under pressure to work quickly? These are questions asked by not only score users, but (more important and more frequently) by test takers themselves.

 

Scores Not Available

Students often ask what it means when their TOEFL account says something like:

“Scores not available”

Or:

“Tested – Scores Not Available”

This  is totally normal.  Everyone gets something like this while waiting for scores.  If your status says “scores not available,” just keep waiting.  According to ETS, it takes 4-8 calendar  for the scores to arrive.

If your scores take longer than that, you could contact customer support.

Scores “Pending”

Sometimes your TOEFL account says that scores are “pending.”  This is also totally normal.  It happens about one or two days after the test is complete.  You should keep waiting for the scores to arrive.  It is not a sign of trouble.

Status:  “Scheduled” and “Checked in”

Sometimes students are confused, because after the test the status says:

“Scheduled”

or

“Checked In”

Those are also normal.   Don’t worry about them.  Sometimes you will be “checked in” for months after the test.  Just wait 4-8 days for your score.  If it takes longer than that, you can call ETS.

Scores “On Hold”

Sometimes your status says something like:

“On Hold”

Or:

“Tested – Scores On Hold”

This status is not good.  It means your TOEFL scores are in “administrative review” and you will have to wait longer than normal.  According to ETS, you will have to wait 2 to 4 weeks after they send you an email about the hold. 

This usually happens during the TOEFL Home Edition if there was some technical problem, or if the proctoring system detected something abnormal.  Sometimes students think everything was totally fine… but the scores still get put on hold.  I’ve got an entire blog post about this issue.  Basically, though, you can call the office of testing security at ETS if you want more information.

If you guys see any other statuses please let me know and I will add them to the list.

 

Wrapping up this series on “templated responses” I want to share a few paragraphs recently added to the ETS website (via the new FAQ for TOEFL):

Some test takers use templates in the speaking or writing sections of the TOEFL iBT test. It can be considered a kind of “blueprint” that helps test takers organize their thoughts and write systematically within the limited time when responding to the speaking section or composing an essay.

However, there are risks to using templates. They can be helpful to establish a general structure for your response, but if they do more than that, you’re probably violating ETS testing policies. The test rules are very strict that you must be using your own words in your responses, and not those from others, or passages that you have previously memorized.

ETS uses detection software to identify writing passages that are similar to outside sources or other test takers’ responses. If you use a template, there is a high probability of providing responses similar to those of other test-takers. So we strongly recommend that you produce your own responses during the test itself.

This is significant, as is perhaps the first time ETS has directly referenced “templates” in a communication to test takers.  The TOEFL Bulletin has long contained references to “memorized content,” but that’s not quite the same thing.

The verbiage may be tricky for some to fully grasp.  Templates are “helpful to establish a general structure” but test takers “must be using [their] own words” in their responses.  When does a template cross the line from being helpful to being a violation of ETS testing policies?  That’s not immediately clear.

However, John Healy reminded me to check out “Challenges and Innovations in Speaking Assessment,” recently published by ETS via Routledge.  In an insightful chapter, Xiaoming Xi, Pam Mollaun and Larry Davis categorize potential uses of “formulaic responses” in speaking answers and how they should be viewed by raters.  Those categories are:

  1. Practiced lexical and grammatical chunks
  2. Practiced generic discourse markers
  3. Practiced task type-specific organizational frames
  4. Rehearsed generic response for a task type
  5. Heavily rehearsed content
  6. Rehearsed response

I think these categories are self-explanatory, but here are a few quick notes about what they mean:

  1. Formulaic expressions (chunks of sentences) stored in the test taker’s memory.
  2. Stuff like “in conclusion” and “another point worth mentioning…”
  3. Stuff like “The university will ___ because ____ and ____.  The man disagrees because ___ and ___.”
  4. Like category three, but without blanks to be filled in.
  5. Like number 1, but the content “[differs] from formulaic expressions present in natural language use.”  This content is “produced with little adaptation to suit the real task demands.”
  6. A response that is “identical or nearly identical to a known-source text.”

Dive into the chapter for more detailed descriptions.

It is recommended that categories 2 and 3 be scored on merit, that category 1 be scored on merit if the chunks do not match known templates, that spontaneous content in categories 4 and 5 be scored on merit (if any exists), and that category 6 be scored as a zero.

Very sensible guidelines.  But putting them into use?  The chapter notes:

“Unfortunately, the guidelines did not remove the challenge of detecting what particular language has been prepared in advance. Certain types of delivery features may be suggestive of memorized content, such as pausing or other fluency features, or use of content that appears ill-fitting or otherwise inappropriate. Nonetheless, it can be quite challenging to detect formulaic responses, especially if the response references a previously unknown source text.”

According to the article, ETS experimented with supplying raters with examples of memorized content to refer to while making decisions, but that didn’t work out for a variety of reasons. Raters became overly sensitive to memorized content. The rating process became too slow.

The authors wrap up the article by making a few suggestions for future study, including redesigned item types and AI tools.

To me, AI tools are a must in 2024, both in terms of correctly identifying overly gamed responses and avoiding false positives.  A quick glance at Glassdoor reviews suggests that response scorers (of a variety of tests) are often low-paid workers who sometimes feel pressure to work quickly.  Tools that help them work more efficiently, accurately and swiftly seem like a good idea.

It is worth sharing a few notes from a long article published by Pearson in the summer.  It provides more information about the topics discussed in Jarrad Merlo’s webinar about the introduction of human raters to the PTE tests.

The article describes how a “gaming detection system” has been developed to aid in the evaluation of two of the speaking questions and one of the writing questions on the PTE. This system gives each response a numerical score from 0 to 1, with 0 indicating the complete absence of gaming and 1 indicating significant evidence of gaming.

This numerical approach seems wise, as the lines between acceptable and unacceptable use of “templated responses” in a response are sometimes blurred.  In a future post, I’ll summarize some research from ETS that discusses this topic.

Meanwhile, Pearson’s article notes that:

“While more rudimentary systems may rely on a simple count of words matching known templates, PTE Academic’s gaming detection system has been designed to consider a number of feature measurements that quantify the similarity of the response to known templates, the amount of authentic content present, the density of templated content, and the coherence of the response”

The article goes on to describe how the results of the AI checks are passed along to human raters, to aid in their decision-making regarding the content of responses.  It notes that the newly-implemented system:

“enables raters to make better informed content scoring decisions by leveraging the Phase I gaming detection systems to provide them with information about the about the [sic] extent of gaming behaviours detected in the response.”

That’s fascinating.  I’m not aware of another system where human raters can make use of AI-generated data when making decisions about test taker responses.

The article notes that human raters will not check all written responses for templated content.  Checks of most responses will be done entirely by AI that has been trained on a regularly-updated database of templates discovered via crawls of the web and social media.

A challenge with this approach that goes unmentioned is the difficulty of detecting templates that don’t show up on the public web.  In my neck of the woods, students pay high fees to “celebrity” test prep experts who create personalized templates that are neither shared publicly nor repeated by future test takers.  This came up in an article by Sugene Kim which I’ll share in the comments.

Perhaps Pearson should go whole hog and bring in human raters for some or all responses in the writing section as well.

More on this in the days ahead.

The PTE Academic now has human raters!  I learned this in a webinar hosted by Pearson’s Jarrad Merlo. Moving forward, a human rater (or more than one?) will check all test taker responses to the “describe Image” and “retell lecture” questions.

Human raters will only grade responses in terms of content.  Other features (pronunciation, etc) will still be graded entirely by AI.

Previously, human raters were only used when the AI scoring engine determined that responses were anomalous.

This change seems to be part of an effort to reduce the use of “templated responses,” a term that Jarrad used at least 37 times in the presentation.

The last part of the webinar discussed what is meant by “templated responses” in the context of the PTE, but I had to teach someone how to take (some other) test, so I missed it.

Below are some images, taken from the presentation, that demonstrate how this works. You can catch a replay of the webinar right here.

 

 

 

The folks at IELTS published a wonderful article last week week which “benchmarks and examines the scoring and equating practices across different language tests… for the purposes of professional registration and entry into university study leading to professional registration.”

The article is an interesting read.  It describes how institutions often fail to follow the recommendations of test makers when it comes to setting score requirements (that is, they set requirements that are too low) and how institutions don’t seem to make use of score concordance tables provided by test makers when setting requirements for various tests.

The authors note that:

“It is worth considering the mixed messages that differing equivalence scores across tests and institutions sends to international students and education agents. Poor equivalence affects candidates with borderline proficiency, because people can simply locate an easier option via a poorly set equivalence score on another test without needing to improve their underlying English skills. Furthermore, poor test score equivalency means that some tests will appear better to take than others.”

And:

“To set a score too low may have consequences, because it can mislead individuals into believing they have the right skillsets to complete their study or to safely practice as professionals.”

The article goes on to talk about how low score requirements can impact the families and communities of test takers.  It discusses how public safety can be affected by low test scores.  It also mentions how low score requirements might cause people to lose confidence in their nation’s public institutions.

Indeed,  it is noted that when institutions fail to reliably use language test scores “social order may be undermined.”

Can this be avoided?  Yes, of course.  The researchers note that:

“There is an opportunity for IELTS to lead the way, by linking with other test providers to provide a united front on what test scores to use and how to equate scores on different tests, and to produce the same equivalence table which would be displayed on each test-developer’s website. The equivalence could be jointly reviewed annually to ensure agreement between tests is met.”

More interestingly, the authors suggest that IELTS be switched to a 0-90 scale (increasing at 10-point increments), which would be easier for institutions to understand and equate to other tests.

I’m scheduled to take the PTE Academic Test on November 21 in Seoul. Let me know if there is anything I should keep an eye out for. I’ll take the test at the “Herald Test Center” in Gangnam, which I have never been to.

A few notes about the registration process:

  1. This test center offers the PTE-A four days a week, three times each day.
  2. There are two places to take the PTE in Seoul. The other is the Pearson Professional Center, where I took the PTE-Core last month.
  3. It took about five minutes to complete the registration process.
  4. I checked, and saw that there is no late booking fee in Korea. That’s nice. Just note that booking is halted 24 hours before the scheduled test time.
  5. The fee in Korea is 286,000 KRW, or about $204 USD. That makes it a little bit cheaper than the TOEFL (which costs $220 USD) and the IELTS (which costs 299,000 KRW). That said, I think the British Council has a November/December sale on right now, so you can take the IELTS for 279,000 KRW.

Big thanks to Pearson for providing a test voucher.

There are three more tests I would like to take before the end of 2024. But time moves at an unimaginable speed.