Forum Replies Created
@darasan – this looks like you’re using Unity – did you find this problematic at all?
I do this and it does work pretty well. I do find that calling a switch on language model isn’t without cost and that it’s not an entirely predictable amount of time – but roughly 0.7s or so which can be quite a tricky one for a grammar as the grammar can miss what the person is saying during the switch.
I kinda fixed it through UI but if there’s a way of making sure it’s seamless, or processing speech that was uttered while the language model was switching then that would be awesome!
Nice one Halle,
Looking forward to the update…
LiamJanuary 20, 2016 at 6:11 pm in reply to: Rule-O-Rama with Microphone, Rejecto at Arms Length? #1027738
I’m using RuleORama with Rapid ears, sorry to use the wrong term; I switch to a probabilistic model when not using a microphone, but always using rapid ears.
I haven’t tried playing with the VAD threshold yet, I’ve seen it here in the forums but I actually assumed this value is controlled by openEars and changes based on your environment.
Would it be sensible to change the VAD threshold for arms length mode? If so can you explain a little bit about how this works?
In a quiet environment at arms length for instance would it be better to have a lower VAD threshold?
Thanks again for your time, If I can get this prototype working I will get my office to buy some email support, I know I ask a lot of questions!
It actually seems OK in practice, my assumptions about blocking were due to the comments in the sample app. But in practice as long as I perform the volume check at a suitable point in the render loop it’s mostly fine. I’ve only tested with SceneKit / SpriteKit (metal backed) but I can keep the frame rate pretty consistentDecember 1, 2015 at 11:05 pm in reply to: When is the best time to call "changeLanguageModelToFile"? #1027456
Is the bigger issue the memory overheard or the time to return a hypothesis?
The bigger issue was the time to return a hypothesis,
I figured I could get around the memory issue as it was only an issue at the time the grammar was created, and I could just create the grammars on the simulator and then have them pre-created on the device. Once the grammar was created the memory consumed went right back down from 300MB+ to ~30MB
But the speed is a huge part of the experience, in my app the voice is the *only* UI but I am providing visual feedback all the time; therefore even slightly sluggish response is undesirable.
have you tried this using a statistical model rather than a grammar?
Using a statistical model is certainly faster – but it comes with it’s own drawbacks – a lot of false positives on low phoneme words, utterances with internal rhymes or lots of alliteration can all trip it up.
e.g Air, Here, There, Her, are all pretty close together in the Fresh Prince rap, and I had to filter out very low phoneme words in the end, not just ‘a’ ‘the’ etc but also words that become low phoneme when said in a range of accents (dropped ‘h’, ‘t’ etc)
A grammar constructed like you advised to use at the bottom of this question : https://www.politepix.com/forums/topic/ruleorama-will-not-work-with-rapidears/
I construct a grammar per sentence; with an array containing the phrase; expanding sequentially.
Then at a natural punctuation point or the end of the sentence I switch to the next pre-created grammar for the next sentence.
This works really nicely, until it doesn’t. It’s necessary for the UI to display the paragraph, or group of sentences on screen for the user to read, and so my two failure points are as follows:
- when somebody speaks so fast that they are a word or two into the next sentence before the language model switch is complete – at this point the new grammar is useless as you cannot ( I think ) start a language model with a hypothesis that you set yourself. I would love it if you could.
- When somebody speaks so slowly that rapid ears detects a period of silence and then restarts the hypothesis midway through the expanding string array
I have degree of control and understanding of the context in which I receive the hypothesis so how I am proposing to get around this is to switch to a statistical model when in the fail states, and then at a natural boundary (end of the sentence or other punctuation point) switch back to using a grammar.
The statistical model should be small enough that I can get good enough results from it, before switching back to a grammar.
It should feel okay to the user, and not like I’ve missed anything, I hope.
A language model switch has to occur after some kind of final hypothesis return since it’s like a micro-restart
Thanks! That’s what I thought, what I will do then is play around with the
secondsOfSilenceToDetectproperty until it feels about right to me – and use the `pocketsphinxDidDetectFinishedSpeech’ callback to figure out if I’m in an ‘error state’ or not.
If you’re still reading this Halle thank you – got my office to buy licenses today for RuleORama and RapidEars
LiamDecember 1, 2015 at 12:51 pm in reply to: When is the best time to call "changeLanguageModelToFile"? #1027449
This is with RuleORama
Of course, I’m still just playing around with this stuff. It would be nice if the API handled the threading for you but it’s not a big deal at all to do it myself.
I think that hacking pocket sphinx to piggyback on their sample data *might* make it possible to detect changes in tone / voice – and this might be a cool thing to have.
I meant the piggybacking on sphinx base to get data for FFTs. Was just curious but I’ll give it a try in any case.
Understood, thanks Halle.
I think that if I check for volume each time a word is detected (regardless of a match), and use a high-pass filter I can detect for fluctuations in volume on a per-word basis. (I think)
@Halle – in the example it’s running from an NSTimer – doesn’t this mean it’s running on the main runloop?
If not – does running this check on a background thread mean I might not have a very accurate impression of how loudly a particular word is being said?
I was hoping to be able to use a high-pass filter to figure out if someone was speaking loudly on a certain word.
Thanks for making this library, I’m really enjoying playing with it.