Part of your TOEFL speaking score comes from the SpeechRater engine, which is an AI application that scores your responses during the speaking section of the TOEFL.  Basically, every one of your answers is graded by one human scorer, and by the SpeechRater engine.  These scores are combined to produce your final score for each response. We don’t know how the human rater and the SpeechRater are weighted.  I assume that the human rater is given greater weight, but I don’t have any evidence to support that claim. 

How does the SpeechRater engine work?  It is hard to answer this question with any certainty, since ETS doesn’t provide all of the details we want to read.  However, an article published recently in Assessment in Education provides some helpful information.

The article describes the twelve features used to score the delivery of a TOEFL response, and the six features used to score the language use of a TOEFL response in one study. It also describes the relative impact of each feature on the final score.

It is really important to note that the article only describes how the SpeechRater engine was used in a specific study.  Remember: when the SpeechRater engine is used to grade real TOEFL tests the feature set and impact of each feature might be different from this study.

So.  Let’s dig into those features and their relative impact. First, the 12 delivery features:

  • stretimemean (15% impact). This feature measures the average distance between stressed syllables. Researchers believe that people with fewer stressed syllables overall are less expressive in using stress to mark important information (source). SpeechRater measures this variable in time between stressed syllables, rather than in syllables themselves.  I would like to experiment with this using implementations of the SpeechRater (Edagree, My Speaking Score) but I find it difficult to eliminate stresses from my own speech.
  • wpsecutt (15% impact).  This is your speaking rate in words per second.  If you say more words, you get a better score.  This has been confirmed by my experiments with the above implementations.
  • wdpchk (13% impact).  This is the average length of uninterrupted speech (chunks) in words.  A chunk is a word or group of words that is separated by pauses (source).  Note that other implementations of SpeechRater have measured chunks in seconds rather than words (source). 
  • wdpchkmeandev (13% impact).  This is the “mean absolute deviation of chunk length in words.” The absolute deviation is important because obviously the average mentioned above can be skewed by the presence of one really long chunk and a bunch of short chunks.  This feature seems to reward people who give answers containing chunks of sensible lengths.
  • conftimeavg (12% impact).  This one is described as “mean automated speech recogniser confidence score; confidence score is a fit statistic to a NNS reference pronunciation model.” I don’t know what that means.  But the article says that it relates to your pronunciation of segmentals, so I suppose it measures how well you pronounce vowel and consonant sounds.
  • repfreq (8% impact). This measures the repetition of one or more words in sequence.  As in:  “I like like ice cream.”  I have experimented with this a bit, and was able to reduce my score by about two points (out of 30) by inserting a bunch of such repetitions.
  • silpwd (6% impact).  This measures the number of silences in your answer of more than 0.15 seconds.  Pauses hurt scores!  Note that I’ve also seen this referred to as measuring pauses of greater than 0.20 seconds, but don’t ask me for a citation.
  • ipc (6% impact).  This is said to measure the “number of interruption points (IP) per clause, where a repetition or repair is initiated.” I’m not quite sure what that means. Obviously, though, it has something to do with moments when the speaker backtracks to correct an error in grammar or usage (like:  “Yesterday, I go… went to school.”)
  • stresyllmdev (5% impact).  This is “the mean deviation of distances between stressed syllables.”  Again, it encourages the speaker to have sensible distances between stressed syllables, rather than merely having a nice average.  I think.  I’m not much of a mathematician. 
  • L6 (3% impact).  This is described as “normalised acoustic Model (AM) score, where pronunciation is compared to a NS reference model.”  I am not sure how this differs from “conftimeavg” above.  Again, though, it relates to your pronunciation of segmentals.  Teachers and students need to know that proper pronunciation is probably a good thing.
  • longpfreq (3% impact).  This measures the number of silences greater than 0.5 seconds.  It is interesting that the SpeechRater engine has a separate category for really long pauses.  Some implementations seem to combine these into a single reported result, while others provide two separate pause-related results.  This certainly warrants some experimentation.
  • dpsec (1% impact).  This measures all of the “umm” and “eer” disfluencies.  Interestingly, these don’t seem to matter at all!  I suppose, though, there is a risk that disfluencies can impact the pause and chunk related features.  I will experiment.

Next, the 6 language use features:

  • types (35% impact).  This measures “the number of word types used in the response.”  There is no definition for “word types” but we can assume it refers to: adjectives, adverbs, conjunctions, determiners, nouns, prepositions, pronouns and verbs.  I guess the SpeechRater engine rewards answers that include all of those. Most of those will be used naturally in an answer, but it is easy for students to forget about adjectives and adverbs.  And, obviously, lower-level students will not be able to use conjunctions properly. I don’t really know if SpeechRater is looking for a certain distribution of types. 
  • poscvamax (18% impact).  Oh, dammit, this is another hard one.  It is described as “comparison of part-of-speech bigrams in the response with responses receiving the maximum score.”  It is touted as measuring the accuracy and complexity of the grammar in an answer. A bigram is a sequence of two units (source). I would assume, in this case, that it is two adjacent words.  Perhaps SpeechRater purports to measure grammar by comparing how you paired words together to how other high scoring answers paired words together.  Yes… you are being compared to other people who answered TOEFL questions.  In my experience, SpeechRater’s grammar results have been wonky and some implementations don’t bother showing them to students.  I think EdAgree removed this from their results recently.
  • logfreq (15% impact).  This measures how frequently the words in your answer appear in a reference corpus (the corpus is not named).  It purports to measure the sophistication of the vocabulary in the response.  I guess this means that the use of uncommon words is rewarded… but surely there is a limit to this.  I don’t think one can get a fantastic score by using extremely uncommon words (as they would sound awkward).
  • lmscore (11% impact).  This “compares the response to a reference model of expected word sequences.”  I’m not sure what this means, but it seems like you will  be rewarded for stuff like proper subject-verb agreement.  One imagines that “Most cats like cheese” is a more expected sequence than “Most cats likes cheese.”  Teachers and students should probably just assume that proper grammar is rewarded, and improper grammar is penalized.
  • tpsec (11% impact).  This measures the “number of word types per second.”  Again, we don’t have an official definition of “word types” but my assumption is that students are rewarded for using a greater variety of word types in the answer.  That is to say, the SpeechRater may not be looking for a specific distribution, but rewards a simple variety of types.
  • cvamax (10% impact).  This compares the number of words in the given answer with the number of words in other answers that got the best possible score.  Popular wisdom seems to be that the best scoring answers are 130 words in the independent task and 170 words in the integrated tasks.

I think I will leave it at that, but please consider this post a work in progress.  I’ll add to it as I continue to carry out research. 

 

 

Hey, here’s something really amazing.

ETS has created a new subsidiary called EdAgree.  EdAgree is described as

…an advocate for international students providing a path to help students identify universities that will push them towards longer term success. We help you put your best foot forward during the admissions process and support you throughout your study abroad and beyond. 

As part of this mission, they provide free English speaking practice using the same SpeechRater technology that is used to grade the TOEFL!  

To access this opportunity, register for a free account on EdAgree.  After that, look for the “English Speaking Practice” button in the student dashboard.  The screenshot is from the desktop version, but it also works on mobile.

This section provides a complete set of four TOEFL speaking questions.  After you answer them, you’ll get a SpeechRater score in several different categories (pause frequency, distribution of pauses, repetitions, rhythm, response length, speaking rate, sustained speech, vocabulary depth, vocabulary diversity, vowels).  These categories are used on the real TOEFL to determine your score!  You can also listen to recordings of your answers.  Note that your responses are scored collectively, rather than individually.  That means, for example, that you get a “pause frequency” score for how you answered all four questions, and not a separate “pause frequency” score for each individual answer.

Update: The list of above categories has been revised a few times, as EdAgree has tweaked the tool.

Note that you will get fresh questions every five days.  I do not know how many unique sets there are in total.  Keep visiting and let me know.  However, you can repeat the same questions as many times as you wish.

I took a set a few days ago, and the questions were pretty good.  They weren’t 100% the same as the real TOEFL, but they were better than what is found in most textbooks. 

It should also be noted that you could probably just use your own questions instead of the ones provided.  Do you get what I mean?  You are being scored based on technical features, which means that the scores will still be relevant no matter what question you answer.

Let me know if you guys enjoy the tool.  Meanwhile, here is my first set of results.  I still have room for improvement, as you can see!

Note:  This screenshot does not include all of the categories mentioned above, as they were not available when the service started.

EdAgree SpeechRater

 

Hey, I’ve been uploading a bunch of stuff to the YouTube channel without really mentioning it here.  One of the more popular videos is the 2021 version of my guide to the independent speaking task. Check it out!

If you are taking the TOEFL Home Edition, make sure to check your microphone.  Don’t just use the ProctorU website test, but actually make a recording and listen to it.

I often get sample answers from students that sound horrible.  They sound like they were recorded using Thomas Edison’s wax tube machine.  I can barely understand what they are saying.  The worst part is that the TOEFL raters will have the same challenge!  G-d only knows how this problem will affect the automated scoring engine used by ETS nowadays.

Internal microphones (like in your laptop) are often terrible.  If yours is bad, consider getting an external microphone to use on the test.  Just remember that you cannot use headphones.  You should use one that sits on your desk.

I’m not a microphone expert, but my favorite cheap and tiny model is this one, from Samson.

The other day, someone asked:

I’ve got twelve months to prepare for the TOEFL, and I need 100 points.  What should I do?

The good news for that student is that they have time to really improve their English fluency instead of just learning TOEFL tricks and strategies.  I know it sounds crazy, but the best way to increase your TOEFL score is to become more fluent in English.

 

Here’s how I responded:

  1.  Get a good grammar book like “English Grammar in Use” (also called “Grammar in Use – Intermediate” in some countries).  I read about a dozen TOEFL essays every day, and I see that most students suffer from grammar and language use problems.   Reduce your error rate and your writing score will go up.
  2. Find someone to practice speaking with.  To improve your score you need to speak fluidly.  You need to eliminate pauses, “umm breaks”, and repetitions.  You need to pronounce vowels and consonants properly.  You need to reduce the effort required to understand what you are saying.  Regular practice will help with this.  You don’t necessarily have t pay big bucks for a special TOEFL teacher to do this.  You can probably find an affordable tutor on a service like italki for this.
  3. Take accurate practice TOEFL tests.  There are 15 official ETS practice tests available (Official Guide x 4, Official iBT Tests x 10, website x 1) plus some PDF junk on the website.  You should work through all of those.  Fortunately, you have time to buy all of the books!  Switch to unofficial material only when you run out.
  4. If you have a year to prepare you can also improve your reading and listening skills in a general sense.  Spend some time reading good non-fiction books and articles (I like Science News, and National Geographic).  Make use of your local library, if they have an English section.  For listening, try Khan Academy, or podcasts like 60 Second Science.
  5. Towards the end of your preparation period take one of the scored practice tests from ETS to gauge your current level and see how to use the last few months most effectively.

 

And, yes, along the way you should devote some time to becoming familiar with the test.  Read the Official Guide cover to cover (a few times).  Read some of the guides on this website and watch some Youtube videos.  Review sample writing and speaking responses.  Just don’t get bogged down in “strategies” if the test is still a year away.

Today I want to talk a little bit about increasing your TOEFL speaking score by giving persuasive rather than descriptive responses in TOEFL speaking question one.

Descriptive responses merely describe something, while persuasive responses try to persuade the grader that your argument is a good one.

Note that since you have so little time to speak in this response (just 45 seconds!) the difference between a persuasive answer and a descriptive answer is very tiny.  But I think there is a real difference.

Here’s what I mean.

Imagine you’ve been asked if you prefer taking online classes or in-person classes and you’ve picked online classes.  This supporting reason is descriptive:

“First, we can take online classes at any time.  I am a mom and the best time for me to study is at night, and in-person classes are usually during the day.  Moreover, I can take a class at night while watching my kids.” 

This is descriptive, as I’m merely describing some of the features of online classes.  The grader might be wondering so what?  Why are these good things?  

In comparison, here is a persuasive reason:

“First, we can take online classes at any time.  I am a mom and the best time for me to study is at night, and in-person classes are usually during the day.  Moreover, I can take a class at night while watching my kids.  This flexibility allows busy parents to improve their lives by getting university degrees”  

That is a bit more persuasive.  It describes what an online class is, but also mentions a reason why these things matter.  Hopefully I’ve persuaded the grader that the stuff I’ve mentioned is important. As you can see, it is possible to turn a descriptive reason into a persuasive reason just by adding a universal long-term benefit.  Like I did here.

This is part of what the speaking rubric means when it talks about a “clear progression of ideas,” I believe.

I think there are a few things to mention about this strategy:

  • If you include two reasons, you probably only have room to do it in one of them.  That means one descriptive and one persuasive reason.
  • This whole article can be summed up as “mention a long term benefit of one of the reasons”
  • I do want to emphasize that in such a tiny little argument (three sentences!) the difference between persuasive and descriptive is very slight.  Don’t get too hung up on terminology.
  • Since this technique involves adding more content it does require the student to speak at a natural pace and without a lot of pauses.  
  • DON’T PANIC

 

 

Here’s a mildly interesting article about student responses to speaking question three.  The authors have charted out the structure of two sample questions provided by ETS, and tracked how many of the main ideas students of various levels included in their answers (again, provided by ETS).

There is some good stuff in here for TOEFL teachers.  Particularly in how the authors map out the progression of “idea units” in the source materials.  They identified how test-takers of various levels represented these ideas units in their answers, particularly how many of these idea units they included in their answers.  Fluent speakers (or, I guess, proficient test-takers) represented more of the idea units, but also presented them in about the same order as in the sources.

Something I found quite striking, is that one of the question sets studied was much easier than the other one, something described by the authors of the report.  I am left wondering how ETS deals with this sort of thing.  The rubric doesn’t really have room to adjust for question difficulty changing week by week.

There is also a podcast interview with one of the authors.

 

Update from April 2021:  The app was removed from the Play Store.  I don’t know what’s up with that.

Earlier this month, ETS quietly released a new language learning app to the Google Play Store and the App Store.  It’s called ELAI.  It seems to use their “SpeechRater” technology to grade sample speaking responses recorded using the app.  This makes it a very valuable tool for TOEFL prep, since student answers on the TOEFL test are partially graded by that particular technology.

Of course the app isn’t specifically designed for TOEFL prep, so it won’t give you actual TOEFL scores, but it will give you feedback based on word repetition, vocabulary level, pauses and filler words.  It will also tell you your words per minute.

There are some sample questions that look like TOEFL questions and some questions and some that don’t look like TOEFL questions.  You decide how long you want to speak in your answer, so you can easily stop after 45 seconds to simulate the test.  I suppose you could actually ignore the given questions  and just record an answer to a question you’ve gotten elsewhere and still get valuable feedback.

Note, though, that this seems to be in a sort of beta test. This means it isn’t available in all countries and it isn’t available for all devices.  Don’t complain if you can’t download it.

Here are the links:

If you are able to try it out, leave a comment down below.

Note:  This website is not endorsed by ETS.

If you are going to take the at-home version make sure to TEST YOUR MICROPHONE. And I don’t mean just using the ProctorU website. I mean making a whole lot of test recordings.  And actually listening to them carefully.

I can’t prove it, but I think a lot of students are getting low speaking scores (and cancelled scores) because of bad microphones.

Moreover, I can state that about 50% of the recordings that students make at home and send to me for evaluation sound like garbage.  Like they were made on some of Thomas Edison’s wax tubes.

Today I want to write a few words about an interesting new (December, 2019) text from ETS.  “Automated Speaking Assessment” is the first book-length study of SpeechRater, which is the organization’s automated speaking assessment technology.  That makes it an extremely valuable resource for those of us who are interested in the TOEFL and how our students are assessed.  There is little in here that will make someone a better TOEFL teacher, but many readers will appreciate how it demystifies the changes to the TOEFL speaking section that were implemented in August of 2019 (that is, when the SpeechRater was put into use on the test).

I highly recommend that TOEFL teachers dive into chapter five of the book, which discusses the scoring models used in the development of SpeechRater.  Check out chapter four as well, which discusses how recorded input from students is converted into something that can actually be graded.

Chapters six, seven and eight will be the most useful for teachers.  These discuss, in turn:  features measuring fluency and pronunciation, features measuring vocabulary and grammar, and features measuring content and discourse coherence.  Experienced teachers will recognize that these three categories are quite similar to the published scoring rubrics for the TOEFL speaking section. 

In chapter six readers will learn about how the SpeechRater measures the fluency of a student by counting silences and disfluencies.  They will also learn about how it handles speed, chunking and self-corrections.  These are actually things that could influence how they prepare students for this section of the test, though I suspect that most teachers don’t need a book to tell them that silences in the middle of an answer are a bad idea.  There is also a detailed depiction of how the technology judges pronunciation, though that section was a bit to academic for me to grasp.

Chapter seven discusses grammar and vocabulary features that SpeechRater checks for.  Impressively, it just sticks them in a list.  A diligent teacher might create a sort of check list to provide to students. Finally, chapter eight discusses how the software assesses topic development in student answers.

Sadly, this book was finished just before ETS started using automated speaking scoring on high-stakes assessment.  Chapter nine discusses how the technology is used to grade TOEFL practice tests (low-stakes testing), but nothing is mentioned about its use on the actual TOEFL.  I would really love to hear more about that, particularly its ongoing relationship with the human raters who grade the same responses.