Teachers who work with students in Japan will appreciate this new article in “Language Testing in Asia.” It uses results from the TOEFL (and other tests) to create profiles of language learners in that country. Not surprisingly, the profiles match what our experience tells us.

I like this section at the end:

“When the uneven profiles come from true skill imbalance, learners and teachers may need to decide whether to focus on weaker or stronger skills for further study based on their contexts and needs. While one direction is to improve a weaker skill, a stronger skill could be further improved to compensate for the weaker one.”

Note the last part.

About a third of the students I work with nowadays are from Japan. Most of them come to me for help with the writing section. After looking at their score history I usually offer to help with the speaking section as well. The response is often something like “It is impossible for a student from Japan to score more than 23 in the speaking section, so I’m not going to work on it any more.” Even though their target is 110 overall, they’d rather just max out the other sections than “waste time” on the speaking prep. It’s an interesting approach.  It may be a correct one.

Gary J Ockey and Evgeny Chukharev-Hudilainen published an interesting article in Applied Linguistics that suggests a few interesting things. It highlights Ockey’s earlier research that suggests that the asynchronous tasks used in the TOEFL iBT speaking section “may not sufficiently assess interactional competence.”

More importantly, it compares the use of a human interviewer (a la IELTS) to a Speech Delivery System (like a chatbot) to elicit spoken English from test-takers. It seems to suggest that “the computer partner condition was found to be more dependable than the human partner condition for assessing interactional competence.” And that both were equal in areas like pronunciation, grammar and vocabulary.

Aha! This information could be used to create a better TOEFL test or a better IELTS test. Someone should let the test makers know.

No need, though, as I read at the end that “this research was funded by the ETS under a Committee of Examiners and the Test of English as a Foreign Language research grant.”

Implement it right away, I say.

I mention this now because the research will be presented tomorrow at an event hosted by the University of Melbourne.

Over the last couple of days I have been playing with ChatGTP to create TOEFL Integrated Writing questions.  I’ve had some success.  My creations aren’t perfect, but in only 30 minutes I can easily put together something that is better than what most major American publishers put in their best-selling books.  That’s remarkable.

Can I do the same with the speaking section?  Yeah, I can. 

Today I will share a couple of AI-generated TOEFL speaking questions.  These are both “type 3” questions, which include a short reading and a brief lecture on the same topic.   In both cases I probably spent about 30 minutes revising them to be more “TOEFL-like”.

Note that eventually I will stick these questions onto pages that students can more easily use to practice for the test.

First up, I generated one about a unique animal feature, which is a fairly common topic on this part of the TOEFL test.

Here’s the reading:

Transparency in Animals

Transparency is the quality of being able to see through an object. While transparency is commonly associated with glass and other transparent materials, it is also found in a number of different animal species. Transparency in animals is typically achieved through the use of specialized cells, tissues, or structures that allow light to pass through their body. This can provide a number of benefits to the animal, such as improved camouflage, enhanced communication, and reduced drag while swimming. Studying the mechanisms of transparency in animals could potentially lead to the development of new materials and technologies that are inspired by the natural world.

And here is me reading the lecture:

 

And here is a transcript of the lecture:

Okay, so I’ve got an example of transparency in the wild. The glass squid is a type of deep-sea squid that is known for its transparent body and long, thin tentacles. Glass squids are found in the deep waters of the ocean… ah… I’d say…ah… they are typically found at depths of 1000 meters or more.

The transparency of the glass squid’s body is thought to be a form of deep-sea camouflage, as it allows the squid to blend in with the surrounding water and avoid being detected by predators. The transparency of the glass squid’s body also helps it to avoid being seen by its prey, allowing it to sneak up on unsuspecting fish that it wants to eat. The ability to use transparency in both of these ways is thought to be extremely important for the survival and success of the glass squid in its challenging deep-sea habitat.

Now, in addition to using transparency for camouflage, the glass squid also uses its transparent body for communication. This is interesting. It has light-emitting organs, called photophores, that are located inside of its body and tentacles. The glass squid uses its photophores to flash patterns of light, which it uses to communicate with other glass squids. This allows it to signal its presence to other members of its species and it also allows it to coordinate its movements with other squids in its group.

Okay, so sometimes the lecture in the type three question relates a sort of anecdote from the speaker’s life.  Can the AI produce one of those?  Yes.

Here is a reading:

Regret Aversion

Regret aversion is a psychological phenomenon in which people are more likely to avoid taking risks or making decisions that may result in regret. This is because people tend to experience negative emotions, such as regret or disappointment, more strongly than positive emotions, such as happiness or satisfaction. As a result, people may avoid taking risks or making decisions that may result in regret, even if those risks or decisions could potentially lead to better outcomes. Regret aversion can affect people’s decision-making in a variety of contexts, including financial decisions, personal relationships, and career choices.

And here is me reading the lecture:

 

And a transcript of the lecture:

Okay, so, I have a perfect example of regret aversion. I had this friend. Alex. Now, Alex was always very cautious when it came to making decisions. He was constantly worried about making the wrong choice and regretting it later on. This tendency became especially pronounced whenever he was faced with a difficult decision.

One time, Alex was considering whether to quit his job and start his own business. He had been working at the same company for several years, but he had always dreamed of being his own boss. The idea of starting his own business was exciting, but it was also risky. If the business failed, Alex could lose a lot of money and damage his reputation. At first, Alex was hesitant to follow through on his plans. He was afraid of the potential consequences if it failed. He was worried that if that happened, he would regret his decision and be disappointed in himself. He was also concerned that he would have to go back to working for someone else, which he didn’t wanna do.

As a result of his regret aversion, Alex decided not to quit his job and start his own business. He continued working at the same company, even though he wasn’t happy there. He missed out on the opportunity to pursue his dream of being his own boss, and he continued to feel unfulfilled and unhappy. Much later, Alex realized that his regret aversion had held him back. He had been so afraid of regretting his decision that he had avoided making a decision altogether. He had missed out on a potentially rewarding opportunity because of his fear of regret.

Today I want to pass along a few details from the Virtual Seminar for English Language Teachers hosted by ETS last night.

One of the presenters provided new details about how the speaking questions are scored. He prefaced these details by sharing a sample type 3 speaking question.  Here is the reading part:

And here is a transcript of the listening part:

 

A standard question, right?

Then we were shown the “answer sheet” that is given to raters so they know how to assess topic development.  This is new information.  Here it is:

That’s interesting, right?  My assessment of this is that for an answer to receive a full score for topic development it must explicitly or implicitly reference the term and its definition.  It must also broadly summarize the example.  And then it must include just two of the four main details given in the example.  The last part is new to me.  Generally, I push students to include all of the details.  Perhaps I should reassess my teaching methods.

There you go, teachers.  Some new information about the TOEFL… in 2022.

A few questions remain:

  • Is this always the case?  Will there always be four main details in the example?  Will we always need to include just two of them?  Probably not.  Surely there are cases where more than two details are required.
  • How does this work in lectures which have two totally unique examples?  Often the reading is about a biological feature in animals, while the lecture describes two different animals that have this feature.  Is it okay to ignore one of them?  Probably not.
  • Can any of this learning be applied to TOEFL Speaking question four?  Probably not.
  • Does order matter?  Probably not.

 

 

 

This time I tried to use the SpeechRater implementation at My Speaking Score to produce the highest scoring “short” answer I could come up with.  Once again, my answer is just 74 words in total.  You can hear it below:

 

Here’s a transcript:

Clearly, it’s advantageous to study various subjects while we are at university. This is because it makes it possible to find our true passion. For instance, when I was a freshman, I took courses in chemistry, psychology and history. While I was enamored with all of them, eventually I discovered that I was most fascinated by chemistry. As a result, I ended up majoring in that. Moreover, I gained interdisciplinary insights throughout my journey.

This is almost the same as the last answer.  I’ve improved the vocabulary a little by using slightly less common words.  I added an -ly adverb to the beginning.

The main difference is that I delivered this in a natural speaking voice at a normal pace.  That means I finished after only 30 seconds.  As you will recall, last time I slowed it down to a unnatural pace and finished at 45 seconds.

The result… a score of 3.46, or 27/30.  Wow.  This isn’t a fluke.  I recorded a similar (but slightly different) answer of 80 words and got a score of 3.6.

Fluency

The fluency markers are all fantastic:

But how can that be?  I scored in the 87th percentile for response length, but scored in the 14th percentile for response rate last time… with the same number of words.  My guess is that the SpeechRater compares the number of words in the answer only to responses of the same length.  Do you get what I mean?  My answer was compared only to other 30 second answers and I included more words than most of them.   Again, I want to stress that this isn’t a fluke.  My 80-word answer got a similar response length score.

Perhaps we as teachers we should stress pace rather than word count.

Pronunciation and Vocabulary

Moving on, here are the pronunciation and vocabulary scores:

The pronunciation is better this time, but it still didn’t really like my  rhythm.  That’s a flukier score, I think.  My 80-word answer was in the 98th percentile.  I think the SpeechRater isn’t perfect when it comes to measuring that dimension.  Perhaps I will make 20 unique recordings of the same answer one day and track the scores.

The vocabulary scores are okay.  Perhaps my low word count limited my ability to use a lot of uncommon words in the answer. That makes sense.

Grammar and Coherence

My grammar and coherence scores are as follows

Final Words

So there you have it.  It is possible to get a good SpeechRater score on an independent response with a “short” answer.  This does track with what ETS says about how the human raters check answers.  ETS speakers have said again and again that it is not necessary to speak quickly on the test.  I don’t know if this can be applied to the integrated speaking tasks, but I’ll check my notes about how many actual details need to be included in the answer. At the last presentation I attended it was implied that only half of the details from the lecture of a type three speaking question need to be included.  Maybe thirty seconds is enough.

If any of my faithful readers want to sponsor a few test attempts I’ll march down to the test center and run these experiments in an actual testing environment.

Students often ask me how important it is to speak quickly in the TOEFL speaking section.  Keen students even ask how many words they should include.

I’ve always said that speaking rate is really important.  I’ve urged students to practice speaking quickly, as long as that doesn’t mess with their pronunciation or intonation.  But that’s always just been my gut feeling, stated without solid evidence. In an effort to gather some real data on the issue, I submitted a few of my own practice answers to SpeechRater, the automated scoring software used (together with human raters) by ETS to grade the TOEFL.  I was able to do this by uploading my answers to My Speaking Score, which has licensed SpeechRater.  I encourage both teachers and students to make use of that site.  It is comprehensive and fairly affordable.  A monthly subscription gives you a bunch of credits to upload answers (or if you prefer you can just record them in your browser).  Teacher and student accounts can be linked to facilitate reviews and personal feedback.

I must mention a few disclaimers before I get into the data:

  • The real test uses both a human rater and the SpeechRater.  That means we cannot use SpeechRater alone to completely predict how a given answer would score.
  • While My Speaking Score is meant to be as close to the real TOEFL as possible, it is a third party implementation, so it cannot be perfect.
  • This is a third party blog, not associated with ETS. My interpretation of the numbers below might be totally wrong.

My Sample Answer

Now that I’ve got that out of the way, you should listen to my sample answer:

 

Here’s a transcript:

I think it is much better to study various subjects while we are at university. This is because it helps us to find our true passion. For example, when I was a freshman, I took courses in chemistry, psychology and history. While I loved all of them, eventually I discovered that I was most interested in chemistry. As a result, I ended up majoring in that. Moreover, I gained interdisciplinary insights along the way.

My speaking rate is very low.  It is just 74 words (per the word counter in Google Docs).  I generally recommend about 120 words if students want a perfect score, and about 100 words if they want an “average” score.

However, everything else is pretty good.  My pronunciation and intonation are at a native level.  I’ve included a few fancy words like “passion” and “interdisciplinary” and “insights”.  I’ve also used transitions like “moreover” and “as a result.”  I even included a few conjunctions like “while” and “when.”  There aren’t any “umm” breaks, self-corrections or stutters. 

When I asked experienced TOEFL tutors how this answer would score, I got responses ranging from 24 points to a perfect 30 points.  As you can see, a few of the tutors really liked it!

Meanwhile, SpeechRater gave this one a score of 2.92/4.  That converts to 23 points out of 30.  Good… but not great.

What did SpeechRater Think about my Fluency?

First up, here’s the score for speaking rate.

Not surprisingly, the answer is all the way down in the 14th percentile.  That means I spoke slowly.  There is a penalty for that.  But speaking rate is just one metric.  It can’t account for all the entire seven-point penalty from the SpeechRater.  

Here’s a look at the rest of the fluency metrics:

They aren’t good either.  As you can see, the SpeechRater gave me a poor score for “sustained speech.”  It identified a bunch of disfluencies and I ended up in the 27th percentile.  In this case the disfluencies are silent pauses, but in other answers they might include “uhh” breaks.  It also gave me a fairly poor score for metrics specifically related to pauses, as you can see.  Slowness and pauses usually go hand in hand, as you might expect. 

What did SpeechRater Think of my Pronunciation?

My pronunciation scores were a mixed bag:

Despite my slowness, my rhythm was pretty good.  However, my pronunciation of vowels was merely average.  But how can that be?  I’m a native speaker.  Well, another source of slowness is the way I sometimes draw out my vowel sounds. Notice my pronunciation of “and history” and “all of them” and “most interested”.  The awkwardness is subtle, but noticeable if you are listening for it.  The penalty for doing this is likely small, but I think it added up since I did it multiple times in every sentence.

What did SpeechRater Think of my Vocabulary?

Vocabulary was another mixed bag:

As I mentioned above, I think my answer has a few good words in it.  However, I’m stuck in the 25th percentile for vocabulary depth.  And even though I didn’t really repeat words, my vocabulary diversity score is merely average.  Why?  Well, my guess is that since my total word count is quite low, it is almost impossible for me to include a lot of “uncommon” words.  I mentioned three words I suppose are “uncommon,” but that’s not really enough.  In a more quickly delivered answer I might have had time for seven or eight words, and earned a better score in that domain.  Likewise, a more quickly delivered would have almost automatically included a more diverse vocabulary… and earned a higher score.

A future experiment might involve jamming as many fancy words into an answer of the same length in an attempt to produce the best possible 75 word answer.

What did SpeechRater think of my Grammar?

Ooof.  My grammar score is not good:

I’m all the way down in the 9th percentile.  Again, this is despite the fact that my grammar is flawless.  Again, I think that the brevity of my answer means that I didn’t have the opportunity to use any advanced grammatical structures.  I have a couple of subordinating conjunctions, but that’s about it.  I don’t have any coordinating conjunctions, I don’t have any adverbs and I’m short on adjectives.  There are no conditionals in the answer, either.  Most of the answer is in the past tense. Some people might be able to fit a lot of grammatical conventions into just 75 words, but it isn’t easy. I think my limited use of grammar is common in answers that are delivered slowly.

What did SpeechRater think of my Coherence?

SpeechRater didn’t like my coherence either:

Again, my impression is that my answer was too short to include enough connective devices to please SpeechRater.  There are three obvious transitional phrases in my answer… but an answer with more words overall would naturally have more than that. Likewise, it would probably have a few compound sentences (my answer has none).

Final Words

The point I’m trying to make here is an obvious one, but it is important.  A TOEFL response delivered slowly may draw a low score from the SpeechRater.  In addition to being short overall, it will likely be missing some of the key features the SpeechRater wants to see.  Be careful on test day. 

As I indicated above, a future experiment will be to create the best possible 74 word answer, to see the best-case result for a slow answer.

Part of your TOEFL speaking score comes from the SpeechRater engine, which is an AI application that scores your responses during the speaking section of the TOEFL.  Basically, every one of your answers is graded by one human scorer, and by the SpeechRater engine.  These scores are combined to produce your final score for each response. We don’t know how the human rater and the SpeechRater are weighted.  I assume that the human rater is given greater weight, but I don’t have any evidence to support that claim. 

How does the SpeechRater engine work?  It is hard to answer this question with any certainty, since ETS doesn’t provide all of the details we want to read.  However, an article published recently in Assessment in Education provides some helpful information.

The article describes the twelve features used to score the delivery of a TOEFL response, and the six features used to score the language use of a TOEFL response in one study. It also describes the relative impact of each feature on the final score.

It is really important to note that the article only describes how the SpeechRater engine was used in a specific study.  Remember: when the SpeechRater engine is used to grade real TOEFL tests the feature set and impact of each feature might be different from this study.

So.  Let’s dig into those features and their relative impact. First, the 12 delivery features:

  • stretimemean (15% impact). This feature measures the average distance between stressed syllables. Researchers believe that people with fewer stressed syllables overall are less expressive in using stress to mark important information (source). SpeechRater measures this variable in time between stressed syllables, rather than in syllables themselves.  I would like to experiment with this using implementations of the SpeechRater (Edagree, My Speaking Score) but I find it difficult to eliminate stresses from my own speech.
  • wpsecutt (15% impact).  This is your speaking rate in words per second.  If you say more words, you get a better score.  This has been confirmed by my experiments with the above implementations.
  • wdpchk (13% impact).  This is the average length of uninterrupted speech (chunks) in words.  A chunk is a word or group of words that is separated by pauses (source).  Note that other implementations of SpeechRater have measured chunks in seconds rather than words (source). 
  • wdpchkmeandev (13% impact).  This is the “mean absolute deviation of chunk length in words.” The absolute deviation is important because obviously the average mentioned above can be skewed by the presence of one really long chunk and a bunch of short chunks.  This feature seems to reward people who give answers containing chunks of sensible lengths.
  • conftimeavg (12% impact).  This one is described as “mean automated speech recogniser confidence score; confidence score is a fit statistic to a NNS reference pronunciation model.” I don’t know what that means.  But the article says that it relates to your pronunciation of segmentals, so I suppose it measures how well you pronounce vowel and consonant sounds.
  • repfreq (8% impact). This measures the repetition of one or more words in sequence.  As in:  “I like like ice cream.”  I have experimented with this a bit, and was able to reduce my score by about two points (out of 30) by inserting a bunch of such repetitions.
  • silpwd (6% impact).  This measures the number of silences in your answer of more than 0.15 seconds.  Pauses hurt scores!  Note that I’ve also seen this referred to as measuring pauses of greater than 0.20 seconds, but don’t ask me for a citation.
  • ipc (6% impact).  This is said to measure the “number of interruption points (IP) per clause, where a repetition or repair is initiated.” I’m not quite sure what that means. Obviously, though, it has something to do with moments when the speaker backtracks to correct an error in grammar or usage (like:  “Yesterday, I go… went to school.”)
  • stresyllmdev (5% impact).  This is “the mean deviation of distances between stressed syllables.”  Again, it encourages the speaker to have sensible distances between stressed syllables, rather than merely having a nice average.  I think.  I’m not much of a mathematician. 
  • L6 (3% impact).  This is described as “normalised acoustic Model (AM) score, where pronunciation is compared to a NS reference model.”  I am not sure how this differs from “conftimeavg” above.  Again, though, it relates to your pronunciation of segmentals.  Teachers and students need to know that proper pronunciation is probably a good thing.
  • longpfreq (3% impact).  This measures the number of silences greater than 0.5 seconds.  It is interesting that the SpeechRater engine has a separate category for really long pauses.  Some implementations seem to combine these into a single reported result, while others provide two separate pause-related results.  This certainly warrants some experimentation.
  • dpsec (1% impact).  This measures all of the “umm” and “eer” disfluencies.  Interestingly, these don’t seem to matter at all!  I suppose, though, there is a risk that disfluencies can impact the pause and chunk related features.  I will experiment.

Next, the 6 language use features:

  • types (35% impact).  This measures “the number of word types used in the response.”  There is no definition for “word types” but we can assume it refers to: adjectives, adverbs, conjunctions, determiners, nouns, prepositions, pronouns and verbs.  I guess the SpeechRater engine rewards answers that include all of those. Most of those will be used naturally in an answer, but it is easy for students to forget about adjectives and adverbs.  And, obviously, lower-level students will not be able to use conjunctions properly. I don’t really know if SpeechRater is looking for a certain distribution of types. 
  • poscvamax (18% impact).  Oh, dammit, this is another hard one.  It is described as “comparison of part-of-speech bigrams in the response with responses receiving the maximum score.”  It is touted as measuring the accuracy and complexity of the grammar in an answer. A bigram is a sequence of two units (source). I would assume, in this case, that it is two adjacent words.  Perhaps SpeechRater purports to measure grammar by comparing how you paired words together to how other high scoring answers paired words together.  Yes… you are being compared to other people who answered TOEFL questions.  In my experience, SpeechRater’s grammar results have been wonky and some implementations don’t bother showing them to students.  I think EdAgree removed this from their results recently.
  • logfreq (15% impact).  This measures how frequently the words in your answer appear in a reference corpus (the corpus is not named).  It purports to measure the sophistication of the vocabulary in the response.  I guess this means that the use of uncommon words is rewarded… but surely there is a limit to this.  I don’t think one can get a fantastic score by using extremely uncommon words (as they would sound awkward).
  • lmscore (11% impact).  This “compares the response to a reference model of expected word sequences.”  I’m not sure what this means, but it seems like you will  be rewarded for stuff like proper subject-verb agreement.  One imagines that “Most cats like cheese” is a more expected sequence than “Most cats likes cheese.”  Teachers and students should probably just assume that proper grammar is rewarded, and improper grammar is penalized.
  • tpsec (11% impact).  This measures the “number of word types per second.”  Again, we don’t have an official definition of “word types” but my assumption is that students are rewarded for using a greater variety of word types in the answer.  That is to say, the SpeechRater may not be looking for a specific distribution, but rewards a simple variety of types.
  • cvamax (10% impact).  This compares the number of words in the given answer with the number of words in other answers that got the best possible score.  Popular wisdom seems to be that the best scoring answers are 130 words in the independent task and 170 words in the integrated tasks.

I think I will leave it at that, but please consider this post a work in progress.  I’ll add to it as I continue to carry out research. 

 

 

Hey, here’s something really amazing.

ETS has created a new subsidiary called EdAgree.  EdAgree is described as

…an advocate for international students providing a path to help students identify universities that will push them towards longer term success. We help you put your best foot forward during the admissions process and support you throughout your study abroad and beyond. 

As part of this mission, they provide free English speaking practice using the same SpeechRater technology that is used to grade the TOEFL!  

To access this opportunity, register for a free account on EdAgree.  After that, look for the “English Speaking Practice” button in the student dashboard.  The screenshot is from the desktop version, but it also works on mobile.

This section provides a complete set of four TOEFL speaking questions.  After you answer them, you’ll get a SpeechRater score in several different categories (pause frequency, distribution of pauses, repetitions, rhythm, response length, speaking rate, sustained speech, vocabulary depth, vocabulary diversity, vowels).  These categories are used on the real TOEFL to determine your score!  You can also listen to recordings of your answers.  Note that your responses are scored collectively, rather than individually.  That means, for example, that you get a “pause frequency” score for how you answered all four questions, and not a separate “pause frequency” score for each individual answer.

Update: The list of above categories has been revised a few times, as EdAgree has tweaked the tool.

Note that you will get fresh questions every five days.  I do not know how many unique sets there are in total.  Keep visiting and let me know.  However, you can repeat the same questions as many times as you wish.

I took a set a few days ago, and the questions were pretty good.  They weren’t 100% the same as the real TOEFL, but they were better than what is found in most textbooks. 

It should also be noted that you could probably just use your own questions instead of the ones provided.  Do you get what I mean?  You are being scored based on technical features, which means that the scores will still be relevant no matter what question you answer.

Let me know if you guys enjoy the tool.  Meanwhile, here is my first set of results.  I still have room for improvement, as you can see!

Note:  This screenshot does not include all of the categories mentioned above, as they were not available when the service started.

EdAgree SpeechRater

 

Hey, I’ve been uploading a bunch of stuff to the YouTube channel without really mentioning it here.  One of the more popular videos is the 2021 version of my guide to the independent speaking task. Check it out!

If you are taking the TOEFL Home Edition, make sure to check your microphone.  Don’t just use the ProctorU website test, but actually make a recording and listen to it.

I often get sample answers from students that sound horrible.  They sound like they were recorded using Thomas Edison’s wax tube machine.  I can barely understand what they are saying.  The worst part is that the TOEFL raters will have the same challenge!  This could affect your score… or result in a score hold.

Internal microphones (like in your laptop) are often terrible.  If yours is bad, consider getting an external microphone.

Just remember that you cannot use a headset microphone.  Nothing can cover your ears during the test.  Therefore,  you should use either an internal laptop microphone or one that sits on your desk.

I’m not a microphone expert, but I really like the Samson SAGOMIC Go Mic.  It is pretty cheap, and I use it regularly in my life.  It makes clear recordings.

My favorite “expensive” microphone is the Blue Yeti Nano.