When is the best time to call "changeLanguageModelToFile"?

This topic has 4 replies, 2 voices, and was last updated 8 years, 4 months ago by evilliam.

Viewing 5 posts - 1 through 5 (of 5 total)

Advertisement: “NeatSpeech is great-sounding offline speech synthesis, compatible with iOS6.1, and you can even edit pronunciations!”

Author

Posts
December 1, 2015 at 2:45 am #1027443
evilliam
Participant
Hello

I was wondering if anyone could shed a little bit of light on when Open Ears changes it’s language models.

The comment above the method states that after you make the call the

“model will be changed at the earliest opportunity”

My question is there anything I can do to make this even earlier?

Some context :
I’m really stress-testing the framework to try and see what I can do with it, and am currently attempting to get multiple users to recite the theme to the Fresh Prince of Bel-Air and have Open Ears catch every single word as it is being said.

The best solution I’ve found in terms of accuracy and speed is to construct a grammar of composed of an array of an expanding string of words and setting it to OneOfTheseWillBeSaidOnce

e.g
```
NSDictionary *grammar = @{
     ThisWillBeSaidOnce : @[
         @{ OneOfTheseWillBeSaidOnce : @[@"NOW", @"NOW THIS", @"NOW THIS IS", @"NOW THIS IS A", @"NOW THIS IS A STORY", @"NOW THIS IS A STORY ALL", @"NOW THIS IS A STORY ALL ABOUT", @"NOW THIS IS A STORY ALL ABOUT HOW"]},
}
```
etc…

But this causes problems if there is a halting, stuttered delivery, so I have a follow-up mop array that builds up in the same way and uses the key

OneOfTheseCanBeSaidOnce with a built up array from the mid point, and again with the last couple of words as a phrase and individually.

And this all works great, but… it understandably doesn’t scale well at all.

It maxes out the RAM on the devices when creating a 4 line grammar,
It is essentially too large a grammar to be at all useful, huge performance losses and big memory footprint.

So I just create a whole bunch of grammars ahead of time, and switch between them on the fly as people are reading, which works pretty well, as long as I’m careful about when I switch and keep track of the word position in the prev grammar (as there’s overlap)

so I’m down to essentially a U/X issue, not a code issue.

But… Can someone tell me a bit more about what the earliest opportunity would be for open ears to switch?

A period of silence?

Because I’d love it to be mid-sentence but it doesn’t seem to be *quite* that slick**

(** Pretty damn slick though, I have loads of ideas!)
December 1, 2015 at 12:17 pm #1027448

Halle Winkler
Politepix

Hi,

Is this with RuleORama or stock grammars?

December 1, 2015 at 12:51 pm #1027449

evilliam
Participant

This is with RuleORama

December 1, 2015 at 9:09 pm #1027455

Halle Winkler
Politepix

Is the bigger issue the memory overheard or the time to return a hypothesis? A language model switch has to occur after some kind of final hypothesis return since it’s like a micro-restart, so a pause has to have occurred. Just to verify, have you tried this using a statistical model rather than a grammar?

December 1, 2015 at 11:05 pm #1027456
evilliam
Participant
Is the bigger issue the memory overheard or the time to return a hypothesis?

The bigger issue was the time to return a hypothesis,

I figured I could get around the memory issue as it was only an issue at the time the grammar was created, and I could just create the grammars on the simulator and then have them pre-created on the device. Once the grammar was created the memory consumed went right back down from 300MB+ to ~30MB

But the speed is a huge part of the experience, in my app the voice is the *only* UI but I am providing visual feedback all the time; therefore even slightly sluggish response is undesirable.

have you tried this using a statistical model rather than a grammar?

Using a statistical model is certainly faster – but it comes with it’s own drawbacks – a lot of false positives on low phoneme words, utterances with internal rhymes or lots of alliteration can all trip it up.
e.g Air, Here, There, Her, are all pretty close together in the Fresh Prince rap, and I had to filter out very low phoneme words in the end, not just ‘a’ ‘the’ etc but also words that become low phoneme when said in a range of accents (dropped ‘h’, ‘t’ etc)

A grammar constructed like you advised to use at the bottom of this question : https://www.politepix.com/forums/topic/ruleorama-will-not-work-with-rapidears/

I construct a grammar per sentence; with an array containing the phrase; expanding sequentially.
Then at a natural punctuation point or the end of the sentence I switch to the next pre-created grammar for the next sentence.

This works really nicely, until it doesn’t. It’s necessary for the UI to display the paragraph, or group of sentences on screen for the user to read, and so my two failure points are as follows:
- when somebody speaks so fast that they are a word or two into the next sentence before the language model switch is complete – at this point the new grammar is useless as you cannot ( I think ) start a language model with a hypothesis that you set yourself. I would love it if you could.
- When somebody speaks so slowly that rapid ears detects a period of silence and then restarts the hypothesis midway through the expanding string array
I have degree of control and understanding of the context in which I receive the hypothesis so how I am proposing to get around this is to switch to a statistical model when in the fail states, and then at a natural boundary (end of the sentence or other punctuation point) switch back to using a grammar.

The statistical model should be small enough that I can get good enough results from it, before switching back to a grammar.
It should feel okay to the user, and not like I’ve missed anything, I hope.

A language model switch has to occur after some kind of final hypothesis return since it’s like a micro-restart

Thanks! That’s what I thought, what I will do then is play around with the secondsOfSilenceToDetect property until it feels about right to me – and use the `pocketsphinxDidDetectFinishedSpeech’ callback to figure out if I’m in an ‘error state’ or not.

If you’re still reading this Halle thank you – got my office to buy licenses today for RuleORama and RapidEars

Cheers
Liam
Author

Posts

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic.