HomeForumsOpenEarsReducing the calibration time

This topic has 4 voices, contains 10 replies, and was last updated by  Joseph S. Wisniewski 204 days ago.

Viewing 11 posts - 1 through 11 (of 11 total)
Author Posts
Author Posts
October 9, 2011 at 7:56 pm #7666

jnorton

Is it possible to reduce the length of the calibration? Four seconds is an awfully long delay for my app to start up. Especially since this doesn’t begin until after pocket sphinx is initialized, which takes a while.

I would be willing to reduce the accuracy a bit to speed this up. Ideally, I’d like to have a parameter or two to fiddle with to find the sweet spot for my app.

I have a very small vocabulary consisting of one and two word phrases, if that helps.

Thanks,

James

October 9, 2011 at 8:37 pm #7667

Halle

Hi James,

Not currently — reducing the calibration time in the current code base would result in crashes. I’ll take it under advisement for future features.

I’m surprised to hear that you particularly notice the initialization time. How long does it take and on what device?

October 10, 2011 at 1:04 am #7668

jnorton

On my iPhone 4 I start the pocket sphinx initialization immediately at the top of awakeFromNib in may main ViewController. I don’t get the notification that calibration has begun for another three seconds. Granted, some other things are being initialized in parallel, and I haven’t done profiling to confirm it, but it seems like most of the time is spent in the pocket sphinx initialization.

October 10, 2011 at 9:16 am #7669

Halle

Very surprising, I will revisit how long that takes since I don’t think I’ve ever seen it take a significant amount of time.

October 11, 2011 at 1:10 pm #7679

Halle

Hi James,

I’ve rechecked here and on recent devices I’m seeing <1.3 seconds from starting listening to starting calibration. How large of a vocabulary are you using, what device, what OS, etc?

October 24, 2011 at 4:07 am #7766

slowfoxtrot

So I have tried reducing the calibration time to 1.0 second in the AudioConstants.h file and have experienced nothing but reduced calibration time. My app still works perfectly, has the correct input decibels/environment noise, and starts up super fast. Am I supposed to be seeing something bad happening?

October 24, 2011 at 9:38 am #7767

Halle

Yup, you’re supposed to heavily test apps running with this setup with single recognition sessions running over long time periods (10-30 minutes) and see that it will almost always crash in the neighborhood of the 15 minute mark with a calibration that is only shortened via that method, and that it will not always be able to correctly detect silence in a session depending on what happened during the short calibration in the real world (it’s very easy for a blast of level-pinning noise to last a second). If neither of these possible issues applies to your app, go to town.

Reducing the calibration time without side-effects will mean making some small changes in other classes involved with the driver and then doing a large round of QA since this is a known crash danger.

October 24, 2011 at 12:36 pm #7768

Halle

If you want to make the required changes for doing this in a safe way and test it and tell me about it, going entirely off the top of my head, I suspect these are the changes that would be needed:

The multiplier in CONT_AD_CALIB_FRAMES in ContinuousADModule.mm will be reduced,
numberOfRounds in ContinuousModel.mm will be reduced,
kLeaderLength in AudioConstants.h will be reduced

I personally think that a calibration time of less than 2.5-3 seconds will not give a good UX under real-world noise conditions (i.e., you don’t want a truck going by to spoil the calibration). What you need to test for is accuracy levels under a variety of environments and routes, and crashes over a mix of devices, OSs, environments, and usage patterns (for instance, recognition starting and the device just sitting on the table for a while, which is the crash circumstance that I’ve seen). You might be able to reduce the likelihood of crashes by memcopy’ing some actual wave values to the calibration buffer before starting; it’s something that has crossed my mind as a potential cause of the unexpected behavior. Good luck and let me know how it goes.

Edit: anyone who is interested in reducing the calibration time is welcome to check out the starting recommendations above, or investigate and thoroughly test them on their own, and let me know the results you get — for most if not all new features of this kind, the issue is not in discovering how to do it, but doing enough testing, which is what I’m considering when I say that I’ll take it under advisement for a future release. When I looked into reducing calibration time a little bit in v .911, I discovered that it wasn’t hard to change but that it was extremely time-consuming to test to the point that I could be confident it didn’t cause the previously-seen crashes under different use cases, which is why it wasn’t included. When I have a lot of available time to test, and if it seems like a priority feature, I can check it out thoroughly myself. In that case I would probably only reduce it to ~3 seconds so that it continues to do the job it is needed for, since IMO making it a lot shorter is just going to cause reports of nonrecognition in the wild.

  • This reply was modified 207 days ago by  Halle.
October 27, 2011 at 4:47 pm #7809

Joseph S. Wisniewski

It’s just the VAD being calibrated. Why not do it in parallel with the rest of your startup?

It’s a low computation load. Theoretically, it can run in parallel with models loading, grammars or language models being built or compiled, and even the UI being assembled.

Then make the VAD parameters (there’s only 4 of them) persistent from session to session, and set CONT_AD_THRESH_UPDATE to something like 50 (1.6 seconds) instead of the current 100 (3.2 seconds). Then you can pump the calibration for 2 seconds instead of 4 (In the background, while the models load).

October 27, 2011 at 5:20 pm #7814

Halle

Don’t the values need to be found as close as possible to the start of recognition in order to give them the most ideal opportunity to be from the same environment? I like the idea of running parallel with starting the decoder and halving CONT_AD_THRESH_UPDATE though. The rest of the driver behavior is probably OK to leave alone with that approach.

October 27, 2011 at 5:51 pm #7817

Joseph S. Wisniewski

The VAD literally only uses one parameter, “noise_level”, the peak of a power histogram, as it’s estimate of the background noise. That’s what it’s spending 3.2 seconds trying to figure out. Adaptation always starts from CONT_AD_DEFAULT_NOISE, which is hardcoded to 30.

After the allotted cal time, noise_level typically ramps up to around 34 in a noisy environment, because the adaptation time constants are way too low for anything that doesn’t involve starting in the same environment over and over again. In our environment, it takes another 14 seconds to hit something near 40, which is more typical of driving conditions.

Previous experience has taught me that this sort of value is very “domain specific”, an IVR app might settle over a range of 35-40, an automotive app maybe 40-45, but within that domain, using the last values as the starting point for the next session really gets things up and running faster.

Heck, for our app, 30 is such an epically bad starting point that simply manually specifying 40 instead and totally skipping adaptation gets better results faster. 30 is way low for an iPhone in a car bracket even when the car is parked.

Of course, the Sphinx VAD is on the wrong side of “current”, and definitely not anything we should be doing in 2011 with a 3 gigaflop DSP available to us. I’ve done better in the past with a cepstral VAD that doesn’t need any cal, at all, from office environment to highway speeds. Time to get into vDSP.

The whole VAD is like that. People have been checking in changes to CONT_AD_THRESH_UPDATE, CONT_AD_ADAPT_RATE, CONT_AD_DEFAULT_NOISE, CONT_AD_DELTA_SPEECH, and CONT_AD_DELTA_SIL for 15 years. They’re not “fixing” the system, they’re “locking” it to one task and environment or another.

I can see a simple way to make delta_speech and delta_sil adapt (they don’t, now. They just stay on what they get from CONT_AD_DELTA_SPEECH and CONT_AD_DELTA_SIL for forever, and start them from a table based on a persistent noise_level.

The thing you have to keep in mind about Sphinx (2, 3, 4, or Pocket) is that it isn’t actually designed to be used by anyone, for anything. Seriously, it’s designed to process mounds of precut utterance files and deliver scores in batch mode, so grad students and grant seekers can make changes and say “running this batch proves our change improves performance 14%”.

So, the developers don’t care that the VAD is decades out of date, or that no_search destabilizes the main recognition loop. That’s “user” stuff. You and I, my friend, are “untouchables”. We’re actually doing that filthy “work” stuff and soiling the ivory towers and marble halls.

I’ve been getting into that kind of trouble for decades. ;)

Viewing 11 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic.