The API is trivial, so no questions there… I don’t know if I am experiencing a bug, because I could not find a description of what I am supposed to get in the wav file (thus my original question).
I am trying to understand how I can interpret what is captured via the plugin and how it relates to the events fired via Pocketsphinx. I think making things a bit more transparent on that front would help. I am not asking about the details of implementation; I am happy to pay for the plugin — which I did — and use it, but it would be nice to know what I’m getting in those wav files.
To be more specific: when I look at the time between pocketsphinxDidDetectSpeech, pocketsphinxDidDetectFinishedSpeech and take into account secondsOfSilenceToDetect value (0.4sec) I can’t quite understand how sometimes signal of 250ms (determined from looking at the wav file, i.e. I see silence, very short word, and then silence again; the short word part is 250ms) that triggered VAD ends up reporting 400-450ms between the above events. When I look at the corresponding wav file saved via the SaveThatWav plugin, I get something longer with some leading and trailing silence (trailing silence seems to sometimes correspond to the secondsOfSilenceToDetect, but not always)…
And why am I doing this: because VAD in Pocketsphinx doesn’t do a very good job with mouth noise, clicks, etc. and sometimes (with using Rejecto) still ends up mapping those into something from the grammar… So, I was hoping that I can filter out some of those false positives by looking at relevant durations. Makes sense?