[Resolved] Clarifications on the improved background noise cancellation feature

This topic has 28 replies, 3 voices, and was last updated 9 years, 3 months ago by wfilleman.

Viewing 29 posts - 1 through 29 (of 29 total)

Advertisement: “Don't want to wait for pauses before receiving speech recognition results? try RapidEars!”

Author

Posts
December 8, 2014 at 9:47 pm #1023310

wfilleman
Participant

Hi Halle,

Overall this update is a nice big step forward for voice recognition. I thought the 1.x series was good, this is just that much better. Great work.

Something I’m now running into is the .vadThreshold setting. In my testing if this is left as the default or even 2.0, my 6 Plus is constantly entering the speech detect loop with the slightest noise. That’s not really a problem as I’m seemly getting MUCH better recognition across the room in a quiet environment which is going to be excellent for some of my customers.

If I use the low end of 1.5-2.0 for vadThreshold then where there is noise in the room the recognition engine seems to get pretty unreliable and flooded with the noise and has a really hard time detecting speech. If I bump this up to the max of 3.9 (looks like the framework has an upper limit of 4.0) then I can have lots of noise in the room without OE entering into it’s speech detecting loop until I say something directly into the mic. Again, this is pretty good behavior consider all the noise/music I’ve got playing and seeing it actually work.

The problem comes in when the environment has variable noise levels. Quiet at some parts of the day and noisy during other parts. (My customers use my app in a wall mounted scenario). Maybe I’m wrong on this, but I thought that the OE 1.x series did an auto-tune (calibration) for background noise and would make continual internal adjustments based on noise/recognition. Do I have that right? If so, is that something that can be turned on in 2.0 or is this behavior now up to us to implement by fine tuning the .vadThreshold in real time?

Thanks Halle!
Wes

December 8, 2014 at 10:08 pm #1023313

Halle Winkler
Politepix

Hi Wes,

Thank you for your kind words. It should still be reacting to the environment and updating itself – the vadThreshold isn’t an absolute volume level, but a s/n ratio setting, so it should continue to make sense as those values change. But you might not be able to use a value which is as aggressive as 3.9, since it will tend to reject real speech under normal circumstances.

What is your experience under changing circumstances using a value like 3.0? Your feedback is appreciated on this, since even though I have a lot of different audio in my tests, there’s no substitute for real-world feedback and this is part of the new Pocketsphinx code.

December 8, 2014 at 10:43 pm #1023315

wfilleman
Participant

Thanks Halle,

I was digging into the framework when I saw that vadThreshold was how you describe is as a relative speech/silence threshold. That’s good to know.

I’ve set the vadTheshold to 3.0 and ran a couple of tests with background music from the radio at different volume levels. Overall it seems to be a little better. Now that I know what I’m looking for I can see that it is indeed adjusting to the various sound levels. When the music levels are above what I would say is just background music, it’s really tough to get OE to process the speech, but again, I’m asking a lot of the engine to throw out louder than background music and pull out my speech.

You are right, it’s a fine balance between upping the threshold and keeping it within speech detecting tolerance.

I may offer my users an option to say if they are installing this in a noisy room. If YES then I can set the vadThreshold to 3.0. If no, leave it as the default. What do you think?

Wes

December 8, 2014 at 11:10 pm #1023316

wfilleman
Participant

I’m looking at this a little deeper and I *think* what I’m actually seeing is the OE framework adjusting to the different volume levels quite rapidly. For example, if I have a steady tone as background noise, OE pretty quickly sees this as noise and ignores it. I can then issue speech and it does pretty well.

If I’m playing music with various beat levels, I see OE struggle a little bit trying to determine what to ignore as noise since it’s seeing the threshold cross all over the place with the beat of the music.

I’m wondering if there’s a way to level this auto-adjustment out by increasing the number of frames OE considers for the “noise level” if that makes sense. For example, if OE only looks at a few frames, then the “noise” level would be rapidly changing from low to high and back. If OE looks at a larger group of frames as a moving average, then these intermediate spikes of noise could be leveled out and ignored.

Just my guess, but I think that’s what I’m actually seeing. Adjusting the vadThreshold is a way to work around this issue by forcing a larger discrepancy, but if (what I suspect) is a single frame of louder noise, it punches through the vadThreshold since the low/high detection appears to be pretty tight in terms of a low number of frames to analyze.

There’s no easy answer here as what I’m suggesting would have other tradeoffs as well if I’m even close to the issue.

Wes

December 9, 2014 at 9:38 am #1023319

Halle Winkler
Politepix

I checked in with the CMU project and verified that this is correct (I’ll probably post the response later if I get permission to quote) – recalibration is definitely happening as you’ve seen, and 3.0 is probably the highest value that is correct to use. It is designed to be adaptive to changing environments but expects stationary noise, i.e. no dramatic oscillations that need to be reacted to in very short timeframes (this was also the case that would get the old VAD stuck, so we have an improvement if there’s no stuckness but recognition is sub-optimal).

It might be possible to change the VAD timeslice although it’s probably dangerous or possibly pointless to optimize in that area at the same time it continues to be developed by the Sphinx project.

If you feel like recompiling the framework, there are some config settings you can look at in OEPocketsphinxRunConfig.h related to VAD activity:

// #define kVAD_PRESPEECH //”-vad_prespeech”, int, default ARG_STRINGIFY(DEFAULT_PRESPCH_STATE_LEN), Num of speech frames to trigger vad from silence to speech.
// #define kVAD_POSTSPEECH //”-vad_postspeech”, int, default ARG_STRINGIFY(DEFAULT_POSTSPCH_STATE_LEN), Num of speech frames to trigger vad from speech to silence.

Or if the issue is that recognition is getting stuck, you can also reduce this check for a stuck utterance in OEContinuousModel.m to something lower than 25:

if(([NSDate timeIntervalSinceReferenceDate] – self.stuckUtterance) > 25.0)

Remember that the framework project has to be archived rather than just built or it won’t build a universal framework and you’ll get object errors with either a device or a simulator, depending.

Question, does your app play back audio or does it just take in mic audio?

December 9, 2014 at 3:23 pm #1023326

wfilleman
Participant

Thanks Halle,

Ok, that’s good. That all makes sense with what I’m seeing. No problem with the framework rebuild. I already rebuilt it yesterday to add back in a custom feature I need in my app to be able to disable the bluetooth input option with OE via a BOOL variable on the PocketSphinxController.

I’ll play around with these settings and post back with what I find. The 25 seconds needs to come down for my use cases. Thanks for pointing me in the right direction there.

Yes, my app plays audio as well. One of the features is an IP Camera streaming option that can play video/audio from IP Cameras. While not in use 100% of the time, it’s possible a user could have voice recognition ON while watching their camera. I did look at this yesterday and it appeared to work like the 1.x framework. So, no concerns from me on that front.

Wes

December 9, 2014 at 3:28 pm #1023327

Halle Winkler
Politepix

Great, I will be happy to hear about what you discover. This is a .0 version so as more info comes in from real-world usage there can be adjustments where it makes sense.

December 9, 2014 at 6:48 pm #1023328

OT
Participant

To add to this thread: in my (somewhat limited) testing with both my app and the sample app, the threshold between 2.5 and 3 works well. Default value of 1.5 seems to be too low.

(my testing was with Apple’s headset; with default levels ‘speech’ gets detected even with minor noise that’s far from the mic)

The main issue with the lower values is the end-pointing, which can affect the flow of the application. In other words, even if there is a false speech detection trigger (e.g. noise, bgnd speech, etc.), decoder will typically deal without any problems with that. But, those noise levels that triggered vad and started recognition will also prevent it to end, after user said what they were supposed to say.

This may not be an issue with RapidEars where decoding is done in the real time (so decoder is effectively doing VAD).

December 9, 2014 at 7:57 pm #1023329

Halle Winkler
Politepix

Yup, I’ve raised vadThreshold to 2.0 for the current version and we’ll see what the feedback on that is. Here is the CMU commentary on the VAD:

Our VAD does track the noise level continuously, it updates noise estimation every frame with sliding average of about 5 seconds.[….] it tracks the noise level and raises speech signal when the signal in some frequency band is higher than threshold * noise.

On the other hand, the VAD is designed to work with slowly changing colored (different levels in different bands) noise. It is not supposed to deal with non-stationary noise. The recommended threshold is about current value (2.0) or it could be 3.0 if you expect slightly more noise variation. Values over 3 are not very reasonable. The value of threshold describes how the noise changes (in what boundaries you consider the change as noise), not how the speech change so it should not be tuned.

December 9, 2014 at 8:09 pm #1023330

wfilleman
Participant

Thanks Halle,

Based on their response, how would you expect the pre and post speech values to change and their impacts to overall speech detection? I’m not sure I’m following the link between the code change suggestion and the CMU response. It sounds to me that their sliding 5 sec average is fixed?

Wes

December 9, 2014 at 9:04 pm #1023331

Halle Winkler
Politepix

I’d give it a try since it will change how transient the speech/silent values need to be to trigger. It might not be useful, but it’s the next thing I would personally investigate under the same circumstances.

The reason I asked about audio is because I would also be curious to know whether switching to a voice processing io audio unit would help you (this is something I’m going to investigate for 2.1 once I know everything has settled down from this big release). However, it will reduce the output levels of outgoing audio.

December 10, 2014 at 7:03 pm #1023355

Halle Winkler
Politepix

Oh, I think you might also be able to get the VPIO noise-suppression effect with using audioMode = @”VoiceChat” without having to do a recompile, not 100% sure. I’m curious about whether combining hardware noise suppression and software noise suppression gives you the results you’re looking for, and I think now the VAD probably won’t object to receiving pre-noise-suppressed audio from the mic unlike the old VAD.

The audioMode settings are still essentially experimental from my perspective since they don’t really seem to offer what I’d call an API contract, but it would be very interesting to know your experiences trying @”VoiceChat” in your challenging environment.

December 13, 2014 at 1:04 am #1023527

wfilleman
Participant

Hi Halle,

I’m actually using the VoiceChat mode almost exclusively since the app can also stream IP Camera video/audio. That seems to work as it did before.

I tried adjusting the kVAD_PRESPEECH value from 10 to 20, 50, 100 and it seems to really only change the time it takes to recognize speech. I didn’t notice anything different with the noise suppression, but based on the notes in the header file, I didn’t really expect it to.

This didn’t help me since I need to respond to short one word commands, but for other’s that are looking to activate and track on longer voice patterns, it might help since it would force OpenEars to listen longer for a voice before triggering. This could be beneficial to avoid frequent “Listening” states.

Also I’m finding that 2.0 for VAD is still too sensitive for me as it’s triggering on the slightest noise in the room. I’m starting at 2.5 and giving my users the option to scale up to 3.5 if they have more variable noise in their environments. This seems to be working better for me in my tests.

I’ll post back if I discover anything else of interest.

December 13, 2014 at 9:00 am #1023532

Halle Winkler
Politepix

I’m actually using the VoiceChat mode almost exclusively since the app can also stream IP Camera video/audio. That seems to work as it did before.

Good to know, please mention things like this in troubleshooting – interpreting reports of results with processed audio from an experimental setting is going to be different from interpreting reports of unprocessed audio and a default setting and I’m in the process of evaluating the defaults, so I want to make sure I’m correlating experiences with those settings to common/uncommon use cases as the case may be. Thanks!

December 15, 2014 at 7:27 pm #1023562

wfilleman
Participant

Thanks Halle, my apologies on not mentioning that earlier.

Your comment did give me an idea to try. I went back to setting the audio mode as Default, but actually, I didn’t notice any difference with regards to the VAD or the background noise root issue that I’m investigating.

The reason I need to use the audioMode value of “VoiceChat” is so I can issue system sounds while the mic is on. With using the “default” value for the audioMode, I have to use an AVAudioPlayer object which causes a lag or delay in the issuance of the sound which won’t work in my application.

My IP Camera streaming code uses AudioUnits and that works regardless of which audioMode I use.

Regardless of which option I use, the behavior of the voice recognition and treatment of the background noise seems to be the same to me.

I ended up setting the VAD to the default of 2.0 and giving the user the option to increase this up to 3.5 according to the user’s variable noise environment based on feedback and my testing.

I think I’ve exhausted the options at this point and it’s as good as it’s going to get with the current implementation from CMU. Overall this is an improvement especially since I can offer the use a way to tune to their variable noise environment…something we didn’t have access to before.

I’m considering this topic closed at this point. If I find anything of interest I’ll post back. Otherwise, looking forward to any continual feature enhancements!

Wes

December 15, 2014 at 7:44 pm #1023563

Halle Winkler
Politepix

Hi Wes,

Thanks for hanging in there with the new setup – that sounds to me like a very good solution. I appreciate the update. I will definitely be considering your experiences and other developers’ feedback about vadThreshold, so there may be further developments on that front as I start to get some ideas about it. For now I’m still just taking in the feedback until there’s enough broad-based info to synthesize into a sense of what the developer experience and user experience with it is.

December 18, 2014 at 8:35 am #1023613

Halle Winkler
Politepix

Hi Wes,

I was giving this a bit of thought:

The reason I need to use the audioMode value of “VoiceChat” is so I can issue system sounds while the mic is on. With using the “default” value for the audioMode, I have to use an AVAudioPlayer object which causes a lag or delay in the issuance of the sound which won’t work in my application.

Out of curiosity, have you checked out whether this is still necessary for system sounds with iOS8 + OpenEars 2.0? I don’t know if iOS8 has any fixes for that Core Audio PlayAndRecord bug/mal-feature with system sounds, but OpenEars 2.0 sets a somewhat different and more-minimal audio session now in order to improve coexistence, so it’s probably worth a quick check to see if the default mode behavior there has improved (maybe you’ve already done so, in which case no worries).

December 18, 2014 at 2:52 pm #1023618

wfilleman
Participant

Hi Halle,

Yup, I gave this a try again with iOS 8 + OE 2.0. Same problem I had before under OE 1.x and iOS 7. I couldn’t play system sounds using the system sound player. I had to fall back to use AVAudioPlayer. This combo may be fine for some folks. I just couldn’t use it because of the inherent delay in AVAudioPlayer. The only way I can get fast sound output (system sound path) with OE is to place OE into the VoiceChat mode. This is true for OE 1.x and 2.0 under iOS 7 and iOS 8.

Wes

December 18, 2014 at 3:05 pm #1023619

Halle Winkler
Politepix

OK, thanks very much for confirming for me that it is the same and also that the VoiceChat mode fixes this symptom in the current version – it’s good info to have even if it isn’t the ideal situation.

December 18, 2014 at 3:35 pm #1023620

wfilleman
Participant

Sure thing Halle!

Agreed, it’s not ideal, but it does work. Hey, at the very least, 2.0 didn’t break anything I did for the 1.x line. I can’t say that for other libraries and their major upgrades I’ve had to worked with ;)

Wes

January 6, 2015 at 7:06 pm #1024088

Halle Winkler
Politepix

Hi Wes,

Just wanted to let you know that the new version 2.01 out today has a fix for a VAD behavior bug in OpenEars that could affect accuracy and detection under some circumstances, so although your question wasn’t specifically about accuracy, you might want to check it out and see if it gives you improved results with any of the things you’ve mentioned in this thread.

January 6, 2015 at 8:58 pm #1024098

wfilleman
Participant

Thanks Halle,

I’ll check it out!

Wes

January 13, 2015 at 10:37 pm #1024223

wfilleman
Participant

Hi Halle,

Just had a chance to check out the 2.03. Looks good to me! In fact this seems to fix a problem I was seeing where the voice recognition loop was taking a while to exit and then failing to deliver a valid hypothesis. I’m seeing better recognition overall and in these conditions where the recognition pauses for a bit and then exits with a more reliable hypothesis.

Could be in my head (and limited testing) but an improvement. I’ll be rolling this out to my users over the next couple of weeks in the next update.

Thanks Halle!
Wes

January 13, 2015 at 10:45 pm #1024224

Halle Winkler
Politepix

That’s great! Yes, the fixes up to and including 2.03 would fix exactly the issue you described above, along with some subtler ones. In my own testing, it also did better with noisy environments with lower vadThreshold settings as a result of the bugs causing stuck VAD behavior (which sometimes looked like oversensitivity) being fixed. Thank you for the feedback.

January 13, 2015 at 10:50 pm #1024225

wfilleman
Participant

Ah, excellent. Ok, then glad I’m not just “hearing” what I wanted to hear.

BTW: I went back to “Default” for the audioMode because it looks like VoiceChat was killing the volume output too much for my user’s taste (more so than the older 1.7x OE release). Not sure if there’s a way to adjust this, but thought I’d mention it.

Keep up the great work!
Wes

January 13, 2015 at 11:01 pm #1024226

Halle Winkler
Politepix

Most of the audio session settings are the same in both versions, so if it isn’t just a new device difference or new iOS SDK version difference masquerading as an OpenEars difference (volume levels vary a lot from SDK version to SDK version and from device to device in my experience), I think it will tend to come down to the new version of OpenEars defaulting to audio session mixing now since that’s the only dramatic change. If you wanted to verify that, you could remove the option ‘AVAudioSessionCategoryOptionMixWithOthers’ from the audio session initialization in OEContinuousAudioUnit.m and archive, and see if it returns to the behavior you saw in 1.7. At the moment there is a bug in archiving which means you need to keep your original acoustic models and then drag them back over the archiving-created models after you’ve made your archive; otherwise you can get an error when uploading your app.

A near-future OpenEars version will have the ability to opt out of mixing since I’ve had a case reported where it was desirable to suppress, but I wanted to square away the accuracy/VAD issues first before making any API additions.

January 13, 2015 at 11:15 pm #1024227

wfilleman
Participant

Thanks Halle, I actually tried that and ducking to see if it made any difference. Unfortunately there wasn’t any difference in volume. Still very low.

I agree, I suspect it’s the SDK that’s doing this. It looks like the volume output is more consistent across devices, but it’s just VERY low on VoiceChat. So, looks like I’ll need to go back and stick with the Default setting unless you have any other ideas to try.

Since you are going to be adding in new API’s here, can I request an API to disable Bluetooth? I have to add in my custom code into the OE source to do this each time there’s a new OE release. It’s just a BOOL to set whether the audiosession should include the AVAudioSessionCategoryOptionAllowBluetooth flag.

Wes

January 13, 2015 at 11:36 pm #1024228

Halle Winkler
Politepix

Sorry, no other ideas – in my experience, when it comes down to these SDK-related volume fluctuations it’s good to go with defaults. It’s an area where there is basically no API contract from Apple (like, if you read the audio mode and audio unit type descriptions they usually describe categories of apps and use the word “may” and refer to “signal processing”, rather than describing concrete behaviors, DSPs, or expectations). I had an unexpected development with really low output volume and VPIO in a very early version of OpenEars which led me to the general conclusion that default audio gets the most QA coverage and has the most “contract” to it, and even when it isn’t ideal for the application, it’s often the prudent call for that reason.

Since you are going to be adding in new API’s here, can I request an API to disable Bluetooth?

Sure, nothing speaks against it now since it should be semantically and behaviorally very similar to the mixing opt-out. I’ll add a ticket and barring any unexpected discoveries related to the feature that would cause me to revise my idea that it is straightforward, I will add it at the same time.

January 13, 2015 at 11:45 pm #1024229

wfilleman
Participant

Sure, nothing speaks against it now since it should be semantically and behaviorally very similar to the mixing opt-out. I’ll add a ticket and barring any unexpected discoveries related to the feature that would cause me to revise my idea that it is straightforward, I will add it at the same time.

Thanks Halle! My code is super straight forward to add in the BOOL. The painful side for me is I have to touch several layers of OE code to get the BOOL where it needs to go each time. Glad I won’t have to do that anymore and can use the OE code as-is with this feature addition.

Wes
Author

Posts

Viewing 29 posts - 1 through 29 (of 29 total)

You must be logged in to reply to this topic.