December 23, 2011 at 11:13 am #8255
Hi. I would like to be able to have a method called for essentially each word or phrase detected or, let’s say, every second.
Basically, my problem is that I have very long audio inputs that wind up being watered down to a short sentence near the end of the input. If I could force the program to report its hypothesis every word or couple seconds, I feel it would be more accurate (plus, then it could show the input “live” on the screen as said). Is there any way to do this? I have tried changing the “kSecondsOfSilenceToDetect” value to much lower and it does’t seem to do anything. Any help would be appreciated. Thanks!December 23, 2011 at 12:59 pm #8258
Sorry, this isn’t currently a feature of OpenEars. Generally, for long dictation tasks you are probably going to see best results with a server-based service. Even with live recognition, the end recognition result would probably be the same since it would be resolving the overall utterance using the same engine, acoustic model and language model. Based on my experimentation the advantages for long dictation tasks would be in better UI feedback (but doing speech-based correction on a single error in a long dictation is not fun for the user at all) but not notably different recognition results. The advantages of this kind of recognition are much bigger for shorter command-and-control models because reaction time will be faster for whatever you are controlling in addition to the UI advantage.December 23, 2011 at 9:55 pm #8259
Well, you see, I am attempting to have the audio processed as it is inputted because I need a reaction/action taken while the rest of the audio is still being inputted. For example, a long sentence with a key word at the beginning won’t react until the sentence is fully finished. I need it done within a reasonable time after the word is said. I just want to make sure you understood me. Is this at all possible with this framework?
Thanks again!December 24, 2011 at 12:09 am #8260
OK, I don’t think I quite understand yet why there is recognition needed both for the full sentence and just for a keyword in it while it is underway. Can you give an example? The framework does recognition after the user has paused in speech for the duration in kSecondsOfSilenceToDetect. That’s the only trigger method for beginning recognition that is supported in its API.December 24, 2011 at 7:38 am #8261
A good example would be I guess reading a book. If you were to read a book out loud, I would want it to show each word read on the screen as it is read (or soon after) and that way I could find when a key word is read and perhaps do a sound effect or something. But I still want each word processed for dynamic reasons. Like I said, I tried lowering the kSecondsOfSilenceToDetect to even 0.1 and it still seems to wait until a paragraph is entirely read to process due to some other background noise (which can’t be avoided and should be processed for key words also). I know it’s hard to understand. Sorry if my explanation is poor. Thanks for helping!December 24, 2011 at 10:33 am #8262
OK, that makes sense. This just isn’t a feature of OpenEars.December 24, 2011 at 10:37 am #8263
Is there any way to perhaps modify the code to process the input after so many seconds instead of after a specified period of silence? Thanks again!December 24, 2011 at 11:06 am #8265
Only insofar as any code can be modified to do new things, but what you are talking about isn’t a quick drop-in of some code snippets kind of a situation, it’s nontrivial restructuring in more than one class using more than one language. Search for “partial hypothesis” on the CMU Sphinx forum to get some good ideas about where to get started. You might find that it simplifies things to use the old 0.902 version of OpenEars for experiments with this behavior so that you don’t need to worry about testing the ringbuffer in the audio driver as well.September 27, 2012 at 12:50 pm #11360
- You must be logged in to reply to this topic.