How does the TOEFL SpeechRater Work?

Part of your TOEFL speaking score comes from the SpeechRater engine, which is an AI application that scores your responses during the speaking section of the TOEFL. Basically, every one of your answers is graded by one human scorer, and by the SpeechRater engine. These scores are combined to produce your final score for each response. We don’t know how the human rater and the SpeechRater are weighted. I assume that the human rater is given greater weight, but I don’t have any evidence to support that claim.

How does the SpeechRater engine work? It is hard to answer this question with any certainty, since ETS doesn’t provide all of the details we want to read. However, an article published recently in Assessment in Education provides some helpful information.

The article describes the twelve features used to score the delivery of a TOEFL response, and the six features used to score the language use of a TOEFL response in one study. It also describes the relative impact of each feature on the final score.

It is really important to note that the article only describes how the SpeechRater engine was used in a specific study. Remember: when the SpeechRater engine is used to grade real TOEFL tests the feature set and impact of each feature might be different from this study.

So. Let’s dig into those features and their relative impact. First, the 12 delivery features:

stretimemean (15% impact). This feature measures the average distance between stressed syllables. Researchers believe that people with fewer stressed syllables overall are less expressive in using stress to mark important information (source). SpeechRater measures this variable in time between stressed syllables, rather than in syllables themselves. I would like to experiment with this using implementations of the SpeechRater (Edagree, My Speaking Score) but I find it difficult to eliminate stresses from my own speech.
wpsecutt (15% impact). This is your speaking rate in words per second. If you say more words, you get a better score. This has been confirmed by my experiments with the above implementations.
wdpchk (13% impact). This is the average length of uninterrupted speech (chunks) in words. A chunk is a word or group of words that is separated by pauses (source). Note that other implementations of SpeechRater have measured chunks in seconds rather than words (source).
wdpchkmeandev (13% impact). This is the “mean absolute deviation of chunk length in words.” The absolute deviation is important because obviously the average mentioned above can be skewed by the presence of one really long chunk and a bunch of short chunks. This feature seems to reward people who give answers containing chunks of sensible lengths.
conftimeavg (12% impact). This one is described as “mean automated speech recogniser confidence score; confidence score is a fit statistic to a NNS reference pronunciation model.” I don’t know what that means. But the article says that it relates to your pronunciation of segmentals, so I suppose it measures how well you pronounce vowel and consonant sounds.
repfreq (8% impact). This measures the repetition of one or more words in sequence. As in: “I like like ice cream.” I have experimented with this a bit, and was able to reduce my score by about two points (out of 30) by inserting a bunch of such repetitions.
silpwd (6% impact). This measures the number of silences in your answer of more than 0.15 seconds. Pauses hurt scores! Note that I’ve also seen this referred to as measuring pauses of greater than 0.20 seconds, but don’t ask me for a citation.
ipc (6% impact). This is said to measure the “number of interruption points (IP) per clause, where a repetition or repair is initiated.” I’m not quite sure what that means. Obviously, though, it has something to do with moments when the speaker backtracks to correct an error in grammar or usage (like: “Yesterday, I go… went to school.”)
stresyllmdev (5% impact). This is “the mean deviation of distances between stressed syllables.” Again, it encourages the speaker to have sensible distances between stressed syllables, rather than merely having a nice average. I think. I’m not much of a mathematician.
L6 (3% impact). This is described as “normalised acoustic Model (AM) score, where pronunciation is compared to a NS reference model.” I am not sure how this differs from “conftimeavg” above. Again, though, it relates to your pronunciation of segmentals. Teachers and students need to know that proper pronunciation is probably a good thing.
longpfreq (3% impact). This measures the number of silences greater than 0.5 seconds. It is interesting that the SpeechRater engine has a separate category for really long pauses. Some implementations seem to combine these into a single reported result, while others provide two separate pause-related results. This certainly warrants some experimentation.
dpsec (1% impact). This measures all of the “umm” and “eer” disfluencies. Interestingly, these don’t seem to matter at all! I suppose, though, there is a risk that disfluencies can impact the pause and chunk related features. I will experiment.

Next, the 6 language use features:

types (35% impact). This measures “the number of word types used in the response.” There is no definition for “word types” but we can assume it refers to: adjectives, adverbs, conjunctions, determiners, nouns, prepositions, pronouns and verbs. I guess the SpeechRater engine rewards answers that include all of those. Most of those will be used naturally in an answer, but it is easy for students to forget about adjectives and adverbs. And, obviously, lower-level students will not be able to use conjunctions properly. I don’t really know if SpeechRater is looking for a certain distribution of types.
poscvamax (18% impact). Oh, dammit, this is another hard one. It is described as “comparison of part-of-speech bigrams in the response with responses receiving the maximum score.” It is touted as measuring the accuracy and complexity of the grammar in an answer. A bigram is a sequence of two units (source). I would assume, in this case, that it is two adjacent words. Perhaps SpeechRater purports to measure grammar by comparing how you paired words together to how other high scoring answers paired words together. Yes… you are being compared to other people who answered TOEFL questions. In my experience, SpeechRater’s grammar results have been wonky and some implementations don’t bother showing them to students. I think EdAgree removed this from their results recently.
logfreq (15% impact). This measures how frequently the words in your answer appear in a reference corpus (the corpus is not named). It purports to measure the sophistication of the vocabulary in the response. I guess this means that the use of uncommon words is rewarded… but surely there is a limit to this. I don’t think one can get a fantastic score by using extremely uncommon words (as they would sound awkward).
lmscore (11% impact). This “compares the response to a reference model of expected word sequences.” I’m not sure what this means, but it seems like you will be rewarded for stuff like proper subject-verb agreement. One imagines that “Most cats like cheese” is a more expected sequence than “Most cats likes cheese.” Teachers and students should probably just assume that proper grammar is rewarded, and improper grammar is penalized.
tpsec (11% impact). This measures the “number of word types per second.” Again, we don’t have an official definition of “word types” but my assumption is that students are rewarded for using a greater variety of word types in the answer. That is to say, the SpeechRater may not be looking for a specific distribution, but rewards a simple variety of types.
cvamax (10% impact). This compares the number of words in the given answer with the number of words in other answers that got the best possible score. Popular wisdom seems to be that the best scoring answers are 130 words in the independent task and 170 words in the integrated tasks.

I think I will leave it at that, but please consider this post a work in progress. I’ll add to it as I continue to carry out research.