Forum Replies Created
Thanks, I had to chance to try it now and it works as expected.
The API is trivial, so no questions there… I don’t know if I am experiencing a bug, because I could not find a description of what I am supposed to get in the wav file (thus my original question).
I am trying to understand how I can interpret what is captured via the plugin and how it relates to the events fired via Pocketsphinx. I think making things a bit more transparent on that front would help. I am not asking about the details of implementation; I am happy to pay for the plugin — which I did — and use it, but it would be nice to know what I’m getting in those wav files.
To be more specific: when I look at the time between pocketsphinxDidDetectSpeech, pocketsphinxDidDetectFinishedSpeech and take into account secondsOfSilenceToDetect value (0.4sec) I can’t quite understand how sometimes signal of 250ms (determined from looking at the wav file, i.e. I see silence, very short word, and then silence again; the short word part is 250ms) that triggered VAD ends up reporting 400-450ms between the above events. When I look at the corresponding wav file saved via the SaveThatWav plugin, I get something longer with some leading and trailing silence (trailing silence seems to sometimes correspond to the secondsOfSilenceToDetect, but not always)…
And why am I doing this: because VAD in Pocketsphinx doesn’t do a very good job with mouth noise, clicks, etc. and sometimes (with using Rejecto) still ends up mapping those into something from the grammar… So, I was hoping that I can filter out some of those false positives by looking at relevant durations. Makes sense?December 9, 2014 at 6:48 pm in reply to: [Resolved] Clarifications on the improved background noise cancellation feature #1023328
To add to this thread: in my (somewhat limited) testing with both my app and the sample app, the threshold between 2.5 and 3 works well. Default value of 1.5 seems to be too low.
(my testing was with Apple’s headset; with default levels ‘speech’ gets detected even with minor noise that’s far from the mic)
The main issue with the lower values is the end-pointing, which can affect the flow of the application. In other words, even if there is a false speech detection trigger (e.g. noise, bgnd speech, etc.), decoder will typically deal without any problems with that. But, those noise levels that triggered vad and started recognition will also prevent it to end, after user said what they were supposed to say.
This may not be an issue with RapidEars where decoding is done in the real time (so decoder is effectively doing VAD).
On the same token, is it possible to use JSFG grammars with Rejecto? (although, I could pre-process these and create a corresponding network/lm, so it’s really the first question that I am most interested in)