Using SaveThatWav to capture an "audio only" utterance with no hypothesis?

This topic has 5 replies, 2 voices, and was last updated 8 years, 11 months ago by Halle Winkler.

Viewing 6 posts - 1 through 6 (of 6 total)

Advertisement: “Rejecto is a plugin for OpenEars™ and RapidEars that lets you ignore speech that isn't in your vocabulary!”

Author

Posts
May 12, 2015 at 6:38 pm #1025736

ekobres
Participant

Hello Halle,

My app is using the ARPA models to drive around the UI of the app, but then hands off the more difficult recognition (dictation with large scientific database of terms) to a cloud recognition service – Nuance. At the moment I am using their recognizer code, but this approach causes some problems:

1. Nuance only works online, and I would like to be able to capture the audio bits for archival purposes and possibly to process later when we are online.
2. Transitioning back and forth between OpenEars and Nuance is a royal pain in the butt. The Nuance recognizer hijacks all of the audio routes and AV profiles and then releases them automatically once a recognition is completed.
3. The transition back and forth wrecks the user experience on a Bluetooth device since the BT device thinks the app has released it – and then immediately grabs it again. Some devices eve say things like “call ended” or make a “boop boop” noise when this happens. Not very friendly.

I would like to use the OpenEars recognizer (with the SaveThatWave plugin) perhaps with a null dictionary to just capture the audio and not fire a hypothesis.

Is this possible? And if so – what’s the recommended way to go about it?

I have OpenEars, SaveThatWave and Rejecto all running well in my project – so I’m really just looking for a way to do this that won’t cause unexpected behavior later since this probably isn’t an expected use case for the SDK.

Thanks in advance.

-Erick

May 12, 2015 at 8:51 pm #1025740

Halle Winkler
Politepix

Hi Erick,

I think it might be possible to handle the part about just capturing the speech without trying to get a correct hypothesis out of it – perhaps a Rejecto-only model would handle this requirement. The problem is that the audio session interference is likely to have a negative effect on the behavior of audio buffering and transference into the engine. That is, if the NDK is preventing OpenEars from doing recognition due to the session changes, it will probably also affect whether it can save WAVs which have correct data in them. Have you given it a try? It might be worth some experimentation.

May 12, 2015 at 9:19 pm #1025742

ekobres
Participant

I’m not trying to run both concurrently – I just want to be able to use SaveThatWave to capture the dictation I would like to submit to Nuance. I am willing to use their https: API to submit the bits rather than suffer through their dreadful “SpeechKit” SDK. It just behaves so erratically. I would like to have the benefit of streaming it to them since it will reply faster. And I would love the convenience if their recognizer if it actually worked – but it’s full of bugs.

The latest trick it pulls is to posthumously fire a ton of audio route changes after it claims to have finished with recognition. Of course this causes all sorts of problems with the speech synthesizer and with OpenEars.

At least OpenEars is really, actually done when sisStopListening fires…

So anyway – are you saying that if I pass an empty dictionary to the rejecto language model generator, it would still generate a wave file reliably?

May 12, 2015 at 9:29 pm #1025744

Halle Winkler
Politepix

It might work – not an empty dictionary, but a dictionary that only has Rejecto syllables in it may give you the desired results. I’d recommend putting at least one word in there (like “Ah” or a similar syllable sound) to avoid any error checking against blank entries, then use Rejecto to generate the model and see what SaveThatWave picks up for you. You should probably make secondsOfSilenceToDetect a bit longer so SaveThatWave doesn’t attempt to separate the utterances by syllable.

May 13, 2015 at 5:04 pm #1025757

ekobres
Participant

Can you point me to something that explains how “Rejecto syllables” work?

Will I need to set up to receive null hypotheses?

I’ll try the “Ah” dictionary and see what happens.

For what it’s worth, I’ve got this kind of working with a fairly straightforward HTTP POST to nuance.

I don’t suppose there’s a way to tap into your recognizer to get the audio bits as they arrive? The performance would be a lot better if I could stream the bits to nuance on the fly rather than sending the whole file at the end. (That’s what their recognizer did – which was the only nice thing about it.)

So now all my nuance recognizer gremlins are solved – but the reco is noticeably slower… :( I just can’t seem to hit a home run.

Thanks again.

May 13, 2015 at 5:15 pm #1025759

Halle Winkler
Politepix

Can you point me to something that explains how “Rejecto syllables” work?

No, sorry. My suggestion should be understood as just using Rejecto normally but with the smallest possible vocabulary. It isn’t necessary to have null hyps returned since you aren’t doing anything with returned hypotheses.

I don’t suppose there’s a way to tap into your recognizer to get the audio bits as they arrive?

Sorry, there’s no hook for using the buffers directly. However, secondsOfSilenceToDetect will affect how much silence/statement completion is listened for before an utterance is considered ‘done’ so that might be worth some experimentation.
Author

Posts

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic.