Some problems about pronunciation evaluation

Home Forums OpenEars Some problems about pronunciation evaluation

Viewing 4 posts - 1 through 4 (of 4 total)

  • Author
  • #1020904

    Hi all,

    I’m having some problems about pronunciation evaluation.
    For a sentence, e.g. “how are you?”, my app can get the hypothesis by recognizing the speach of speaker by openears.
    The hypothesis may be “how are “, “who are you” , “how are your baby”, and etc.
    I wonder that if openears or pocketsphinx can get the score of every word in the hypothesis and the score of whole hypothesis to evaluate the pronunciation of speaker.

    Also, what language model required?

    Any thoughts on how to do that?

    Halle Winkler


    There are a number of issues with attempting this. The first is that in Pocketsphinx you can’t get scoring for each word (although this is an option with RapidEars). But the bigger issue is that the impact of pronunciation on scoring is much less than the impact of other factors. One big factor is environmental, i.e., background noises, which mic is used, mic distance, mic differences across different devices, room characteristics. Another is whether the acoustic characteristics of the speaker are well-represented in the acoustic model. For instance, I’ve received reports that very high female voices aren’t as accurately recognized as other male and female voices, and this will manifest in scores to a subtler extent with other underrepresented vocal characteristics. Another factor is that the scoring accuracy is heavily impacted by the size of the vocabulary. The problem is trying to separate these large factors from the relatively small factor of pronunciation quality.

    When developers ask about using the scores I always have the same advice: they can only be used to compare the same utterances within a single session. Meaning that you can gather scores about a number of utterances of the same statement from a single speaker in a single environment in a single app session and say that some scores are better than others and give that feedback to the user if you want. But you can never compare scores across multiple speakers, multiple devices, and multiple usage sessions in order to derive meaningful information, such as saying that a word utterance with a score of -2000 should always be understood to be a correct pronunciation.


    Thank you for your instant reply and good advice.

    Another question:

    static void print_word_times(int32 start)
    ps_seg_t *iter = ps_seg_iter(ps, NULL);
    while (iter != NULL) {
    int32 sf, ef, pprob;
    float conf;
    ps_seg_frames (iter, &sf, &ef);
    pprob = ps_seg_prob (iter, NULL, NULL, NULL);
    conf = logmath_exp(ps_get_logmath(ps), pprob);
    printf ("%s %f %f %f\n", ps_seg_word (iter), (sf + start) / 100.0, (ef + start) / 100.0, conf);
    iter = ps_seg_next (iter);

    Why the conf always is 1.0?
    Can you tell me the truth?

    Halle Winkler

    That looks like a question for the Sphinx project.

Viewing 4 posts - 1 through 4 (of 4 total)
  • You must be logged in to reply to this topic.