Reply To: [Resolved] Noise problem even setting setVadThreshold openEars 2.0

January 2, 2015 at 11:59 am #1024050

Politepix

Hi,

The maximum value for vadThreshold is 4.0.

I would expect 2.0-3.0 to work fine for most applications as long as the secondsOfSilenceToDetect has a realistic relationship with user pauses.

Regarding secondsOfSilenceToDetect, the smallest value I would expect to begin to give a good user experience is .3, since that is getting towards the average durations that speakers pause between utterances. I would probably expect it to be a bit larger at .4 or maybe .5 since the intention isn’t to only catch the speech of speakers whose pauses are shorter than the average.

If you are ever trying to recognize more than one word at a time this should be a value that is distinctly longer than a normal inter-word pause, since setting it to the same length or shorter means you are insuring that the user will have their speech interrupted and recognized while they are still in the process of active app-directed speech, and that also means their subsequent speech will not be heard by the engine while it is in the process of performing recognition on the partial speech that it interrupted, which is a big downside risk for a speech UX. That’s what the default value of .7 is meant to help with. Probably a value like .5 is still reasonable here to split the difference between good UX in terms of getting all the speech and good UX in terms of reactivity.

I don’t think there is a big gain in setting it to a tiny value purely in terms of the sense of reactiveness, because that latency is only going to be a small percentage of the overall interaction time which includes the time the user spoke, and their actual intended period of silence, and the time that the engine took to process the entire finalized speech, which is likely to be a minimum of 3 seconds overall (assuming that everything else that happens in your UI as a result of the speech is instantaneous), making a question of tenths of a second one way or another not the biggest part of the puzzle in terms of UX.

In some cases I think the idea behind setting it very low is to give a RapidEars-like lack of latency, but RapidEars returns the user speech continuously while it is still in progress, which is just a very different UX to the UX of complete utterances being analyzed after they have been finished, regardless of the secondsOfSilenceToDetect time.

In the absence of a real-time approach like that in RapidEars, I think it’s likely to make speech UI users happier when the secondsOfSilenceToDetect corresponds to their intention to denote that their app-directed speech is finished, versus other events such as a inter-word pause or hesitation.