Some problems about pronunciation evaluation

This topic has 3 replies, 2 voices, and was last updated 9 years, 12 months ago by Halle Winkler.

Viewing 4 posts - 1 through 4 (of 4 total)

Advertisement: “Did you know OpenEars™ can use rules-based grammars to recognize fixed phrases? And RuleORama lets you use them with RapidEars!”

Author

Posts
April 18, 2014 at 9:38 am #1020904

gump
Participant

Hi all,

I’m having some problems about pronunciation evaluation.
For a sentence, e.g. “how are you?”, my app can get the hypothesis by recognizing the speach of speaker by openears.
The hypothesis may be “how are “, “who are you” , “how are your baby”, and etc.
I wonder that if openears or pocketsphinx can get the score of every word in the hypothesis and the score of whole hypothesis to evaluate the pronunciation of speaker.

Also, what language model required?

Any thoughts on how to do that?

April 19, 2014 at 9:39 am #1020908

Halle Winkler
Politepix

Welcome,

There are a number of issues with attempting this. The first is that in Pocketsphinx you can’t get scoring for each word (although this is an option with RapidEars). But the bigger issue is that the impact of pronunciation on scoring is much less than the impact of other factors. One big factor is environmental, i.e., background noises, which mic is used, mic distance, mic differences across different devices, room characteristics. Another is whether the acoustic characteristics of the speaker are well-represented in the acoustic model. For instance, I’ve received reports that very high female voices aren’t as accurately recognized as other male and female voices, and this will manifest in scores to a subtler extent with other underrepresented vocal characteristics. Another factor is that the scoring accuracy is heavily impacted by the size of the vocabulary. The problem is trying to separate these large factors from the relatively small factor of pronunciation quality.

When developers ask about using the scores I always have the same advice: they can only be used to compare the same utterances within a single session. Meaning that you can gather scores about a number of utterances of the same statement from a single speaker in a single environment in a single app session and say that some scores are better than others and give that feedback to the user if you want. But you can never compare scores across multiple speakers, multiple devices, and multiple usage sessions in order to derive meaningful information, such as saying that a word utterance with a score of -2000 should always be understood to be a correct pronunciation.

April 28, 2014 at 7:23 am #1021072
gump
Participant
Thank you for your instant reply and good advice.

Another question:
```
static void print_word_times(int32 start)
{
ps_seg_t *iter = ps_seg_iter(ps, NULL);
while (iter != NULL) {
int32 sf, ef, pprob;
float conf;
ps_seg_frames (iter, &sf, &ef);
pprob = ps_seg_prob (iter, NULL, NULL, NULL);
conf = logmath_exp(ps_get_logmath(ps), pprob);
printf ("%s %f %f %f\n", ps_seg_word (iter), (sf + start) / 100.0, (ef + start) / 100.0, conf);
iter = ps_seg_next (iter);
}
}
```
Why the conf always is 1.0?
Can you tell me the truth?
April 28, 2014 at 8:20 am #1021073

Halle Winkler
Politepix

That looks like a question for the Sphinx project.
Author

Posts

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic.