SaveThatWav and VAD

This topic has 3 replies, 2 voices, and was last updated 9 years, 2 months ago by Halle Winkler.

Viewing 4 posts - 1 through 4 (of 4 total)

Advertisement: “RapidEars is an OpenEars™ plugin that lets you perform speech recognition while the user is still speaking!”

Author

Posts
February 10, 2015 at 5:24 am #1024796

OT
Participant

Hi Halle,

Could you please shed a bit more light on how SaveThatWav works with respect to events that correspond to pocketsphinxDidDetectSpeech, pocketsphinxDidDetectFinishedSpeech, and secondsOfSilenceToDetect attribute?

More specifically:
– can I assume there is always (approximately) secondsOfSilenceToDetect seconds of silence at the end of the wav file? (silence as determined by pocketsphinx VAD)

– what does pocketsphinxDidDetectFinishedSpeech event correspond to in the wav file? Can I (roughly) assume that that event was fired secondsOfSilenceToDetect before the end of the wav file?

– what’s the padding in the beginning of the wav file? (seems like wavs are always much longer than end-start speech detected)

– given the answers to the above questions, what’s fed to the decoder

If there is some extra “padding” in the beginning, how does that affect situation when there are multiple simultaneous wav files almost immediately one after the other?

February 10, 2015 at 8:23 am #1024798

Halle Winkler
Politepix

Hello,

Are you experiencing a bug or do you have a question about how to use SaveThatWave’s API in an app? I can’t help with questions about how the plugins are implemented, sorry.

February 10, 2015 at 8:55 am #1024800

OT
Participant

Halle,

The API is trivial, so no questions there… I don’t know if I am experiencing a bug, because I could not find a description of what I am supposed to get in the wav file (thus my original question).

I am trying to understand how I can interpret what is captured via the plugin and how it relates to the events fired via Pocketsphinx. I think making things a bit more transparent on that front would help. I am not asking about the details of implementation; I am happy to pay for the plugin — which I did — and use it, but it would be nice to know what I’m getting in those wav files.

To be more specific: when I look at the time between pocketsphinxDidDetectSpeech, pocketsphinxDidDetectFinishedSpeech and take into account secondsOfSilenceToDetect value (0.4sec) I can’t quite understand how sometimes signal of 250ms (determined from looking at the wav file, i.e. I see silence, very short word, and then silence again; the short word part is 250ms) that triggered VAD ends up reporting 400-450ms between the above events. When I look at the corresponding wav file saved via the SaveThatWav plugin, I get something longer with some leading and trailing silence (trailing silence seems to sometimes correspond to the secondsOfSilenceToDetect, but not always)…

And why am I doing this: because VAD in Pocketsphinx doesn’t do a very good job with mouth noise, clicks, etc. and sometimes (with using Rejecto) still ends up mapping those into something from the grammar… So, I was hoping that I can filter out some of those false positives by looking at relevant durations. Makes sense?

February 10, 2015 at 9:06 am #1024801

Halle Winkler
Politepix

I appreciate your purchase! I don’t think that SaveThatWave can be usefully put to task as a tool for analyzing the Sphinx project’s VAD.
Author

Posts

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic.