How to combine wave files generated by SaveThatWave?

This topic has 42 replies, 3 voices, and was last updated 10 years, 1 month ago by Halle Winkler.

Viewing 43 posts - 1 through 43 (of 43 total)

Advertisement: “Rejecto is a plugin for OpenEars™ and RapidEars that lets you ignore speech that isn't in your vocabulary!”

Author

Posts
January 24, 2014 at 6:31 am #1019884

steve100
Participant

Hi,

It seems OpenEars+SaveThatWave will record whatever used for the recognition, right? But it will generate a new wave file whenever there is a silence. Can SaveThatWave be configured to continue to save into one big wave file until [saveThatWaveController stop] is called? Or is there any way to combine those smaller wave files into one big wave file?

Thanks,

Steve

January 24, 2014 at 8:33 am #1019887

Halle Winkler
Politepix

Hi Steve,

The upcoming version of SaveThatWave does this — I’m just waiting to release it along with versions 1.65 of RapidEars and OpenEars which fix an annoying bug of long standing. I really hope to have them all out in a week or so.

The new SaveThatWave feature just has the caveat that it outputs what PocketsphinxController receives while listening is in progress, so when recognition is suspended there will be a period of silence of that length. Once listening has been stopped, it will produce and notify you of a complete WAV available to you, and if the session stops prematurely it will prepare the saved audio from that session into a WAV during the next session (it’s designed primarily as a tool to let you get a test output when you’re encountering issues).

January 28, 2014 at 11:59 pm #1019966

tyeh
Participant

Will this be a free upgrade for the SaveThatWave 1.64 licensee and when will this update be available? Thanks
-Thomas

January 29, 2014 at 8:16 am #1019969

Halle Winkler
Politepix

Yes, this upgrade will be free and with some luck out in the next week.

January 29, 2014 at 4:42 pm #1019980

tyeh
Participant

Thank you Halle. We are looking forward to incorporating its new capability.
-Thomas

February 7, 2014 at 8:43 pm #1020080

tyeh
Participant

Hi Halle,
Is the upcoming SaveThatWave available yet? Thanks
-Thomas

February 7, 2014 at 9:02 pm #1020081

Halle Winkler
Politepix

Hi Thomas,

Yes, it is out and you can find it on the licensee site as a free update. If you like you can keep track of updates by subscribing to the blog RSS where I post all of the update news.

February 9, 2014 at 8:19 am #1020086

steve100
Participant

Hi Halle,

I got the new plugin and found a new method startSessionDebugRecord. I use this method to replace the place where I called start. But when can I get the wav file? After I call stop, I still couldn’t get the wav file.

Thanks,

Steve

February 9, 2014 at 10:39 am #1020087

Halle Winkler
Politepix

Hi Steve,

Since the new feature captures the entire result of a listening session as the speech decoder hears it, the WAV is wrapped up and a notification is sent to you when the message [self.pocketsphinxController stopListening]; is sent.

February 10, 2014 at 5:34 pm #1020101

steve100
Participant

Hi Halle,

I called [self.saveThatWaveController startSessionDebugRecord] after calling [self.pocketsphinxController startListing] and called [self.pocketsphinxController stopListing] after speaking something. But I never got the indication for the wav file. Both device and simulator seem not working. Here is the message I got for simulator:

2014-02-09 22:27:37.036 hear4me[6021:a0b] Flite has finished speaking
2014-02-09 22:27:37.037 hear4me[6021:a0b] Valid setSecondsOfSilence value of 0.700000 will be used.
2014-02-09 22:27:37.037 hear4me[6021:a0b] Pocketsphinx has resumed recognition.
2014-02-09 22:27:38.728 hear4me[6021:a0b] recogStopClicked started.
2014-02-09 22:27:38.755 hear4me[6021:4f03] .raw files in caches directory are (
)
INFO: file_omitted(0): TOTAL fwdtree 0.19 CPU 0.138 xRT
INFO: file_omitted(0): TOTAL fwdtree 2.73 wall 1.995 xRT
INFO: file_omitted(0): TOTAL fwdflat 0.02 CPU 0.012 xRT
INFO: file_omitted(0): TOTAL fwdflat 0.02 wall 0.012 xRT
INFO: file_omitted(0): TOTAL bestpath 0.01 CPU 0.004 xRT
INFO: file_omitted(0): TOTAL bestpath 0.00 wall 0.004 xRT
2014-02-09 22:27:38.757 hear4me[6021:4f03] No longer listening.
2014-02-09 22:27:38.757 hear4me[6021:a0b] Pocketsphinx has stopped listening.

February 10, 2014 at 5:50 pm #1020102

Halle Winkler
Politepix

Hi Steve,

Try calling startSessionDebugRecord before startListening and see if that fixes it.

February 17, 2014 at 7:27 pm #1020171

tyeh
Participant

Halle,
We use all 1.65 frameworks (openear, rejecto and savethatwave) in our app. I notice when I speak more than 10 words in a sentence, it is typically followed by a long period (30 seconds) of device high cpu utilization (99%) while displaying “Processing speech, please wait…”.
This does not happen when I speak just one of two words in a sentence from the dictionary.
Is this normal? Is rapidear framework supposed to reduce the processing delay?
Thanks
-Thomas

———— console log attached —————-

2014-02-17 09:53:07.048 hear4me[2715:640f] Speech detected…
2014-02-17 09:53:07.050 hear4me[2715:60b] Pocketsphinx has detected speech.
2014-02-17 09:53:09.586 hear4me[2715:640f] There is reason to suspect the VAD of being out of sync with the current background noise levels in the environment so we will recalibrate.
2014-02-17 09:53:09.587 hear4me[2715:640f] Stopping audio unit.
2014-02-17 09:53:09.587 hear4me[2715:60b] Pocketsphinx has detected a second of silence, concluding an utterance.
2014-02-17 09:53:09.684 hear4me[2715:640f] Audio Output Unit stopped, cleaning up variable states.
2014-02-17 09:53:09.684 hear4me[2715:640f] Processing speech, please wait…
2014-02-17 09:53:42.047 hear4me[2715:640f] Pocketsphinx heard “INTAKE OF” with a score of (-149766) and an utterance ID of 000000001.
2014-02-17 09:53:42.048 hear4me[2715:60b] The received hypothesis is INTAKE OF with a score of -149766 and an ID of 000000001

February 17, 2014 at 7:46 pm #1020172
Halle Winkler
Politepix
Hello,

No, that is not to be expected, so we should troubleshoot it a bit. This part:
```
There is reason to suspect the VAD of being out of sync with the current background noise levels in the environment so we will recalibrate.
```
Should only happen if there is a significant difference in background noise all of a sudden, for instance if you are performing recognition at the same time as music is playing or other ongoing and/or escalating background noise, or if there is something peculiar happening with a lot of suspend/resumes. Can you tell me a bit about the circumstances of the recognition? Does it differ from the circumstances in the sample app, and if so in what sense? Is the app having trouble detecting an end to the statement?

This score is also of interest:
```
-149766
```
It’s a notably lower score than I would expect to see as a correct recognition. Is it correct? Is anything interesting about the audio setup?
February 18, 2014 at 5:57 am #1020175

steve100
Participant

Hi Halle,

I work with Thomas.

In the ContinousAudio.m file, you have code
memset(ioData->mBuffers[0].mData, 0, ioData->mBuffers[0].mDataByteSize); // write out silence to the buffer for no-playback times

I just added a flag to decide whether to call this buffer reset. This will enable audio output to earphone. In both scenarios when the flag is on or off, we got this problem.

Thanks,

Steve

February 18, 2014 at 7:47 am #1020177

Halle Winkler
Politepix

Hi Steve,

Do you mean ContinuousAudioUnit.m? Can you explain more about how changing that line changes the behavior? Headphone routing works out of the box with OpenEars and that line just prevents the mic stream from going into the playback stream and causing a feedback loop.

February 18, 2014 at 9:28 am #1020178

steve100
Participant

Halle,

Yes. I mean the ContinousAudioUnit.m file. Depending on my app’s flag, sometimes I don’t call the memset() so that I can hear whoever is speaking. Will this be a problem? But it seems even I always call the memset(), I still got the problem. I’m trying to test it using the original framework without my changes to see what the behavior is.

Thanks,

Steve

February 18, 2014 at 9:52 am #1020179

Halle Winkler
Politepix

OK, I get it. I’m not sure if that’s a problem because it’s a pretty low-level change, but the issue you are seeing probably relates to something else. Note to anyone finding this thread via search: it isn’t necessary to make any code-level changes to OpenEars to support headphone output; the change discussed here is for duplexing of the mic stream, which is something else. If you are having an issue with headphone output with OpenEars, that is not expected and you should suspect the headphones’ connection or volume level of the device. Public service announcement over :) .

So, the recalibration you are seeing that spins up the CPU happens when there is an utterance that is so long that it requires that the buffer containing the search space has to be resized. Typically this only happens for OpenEars developers in the case that the calibration has gotten confused and the end of an utterance is not successfully being detected, so an extra-long utterance is being searched. Starting with 1.65, this will result in a quick recalibration before getting the hypothesis so that the next utterance is performed with the voice activity detection correctly recalibrated. That is the CPU usage. It means that if the user goes into an environment in which the background noise level is much higher, the voice activity detection won’t get stuck in the “nonstop speech” position.

That it is happening for you suggests that the case in which you are seeing it is a very long uninterrupted utterance with no pauses, something like 30 seconds or so. Is that correct? Is this a test case that you’ve thought of as an extreme condition, or is it a real-world thing that happens under a normal use case?

February 18, 2014 at 4:47 pm #1020189

tyeh
Participant

Detecting both conversation (3-5 feet away from the device) as well as close-up recording to the mic are both valid use cases. We don’t anticipate recognizing faint, far away conversation… but I do want to be able to recognize speech when the conversation is directed toward the listener, (aka the talker and the device user are different person). In such use case, the background noise level may vary more in a quick multi-parties conversation rather than a single speaker scenario.
Is this doable without modifying the framework? Thanks
-Thomas

February 18, 2014 at 4:55 pm #1020190

Halle Winkler
Politepix

Hi Thomas,

Sure, that is not a problem to switch between distances or audio routes. My question was more about this scenario in which there is a very long recognition of a single utterance where there is no pause from the user for 30 seconds or so. In my experience this is an unusual usage scenario since most people pause pretty frequently when speaking at it is more usual to get a few utterances which end with pauses, one after another.

I think it is the long utterance with no pause which is resulting in the recognition with the 99% CPU, because that looks to the engine like it is an utterance where the voice activity detector got stuck as a result of a fast change in background noise level. So I was wondering if that is a real behavior your app users have to do or are doing naturally, or if it is an example of an “edge case” that you don’t encounter naturally but is something that you are testing for in order to cover all the bases.

February 18, 2014 at 5:40 pm #1020191

tyeh
Participant

I see. It is not a normal use case (continuous sound without utterance). I will test more and report my findings.

However, in my usage scenario, it is possible that more than one speakers (up to 3) could speak at the same time. The distance of mic to each speaker will be different. How does OpenEars judge an audio “noise”, or a “sentence”? If a remote speaker, say 3 feet away, is talking normally. Then all of a sudden a closeup speaker also speak. The device will detect a sudden rise of noise/sentence volume how will this impact recognition behavior after this point?

Thanks

February 18, 2014 at 6:26 pm #1020193

tyeh
Participant

Halle,
I retest our app in a controlled environment: one single speaker, quiet room with no noise, spoke with a normal sentence (less than 15 words) in 5 seconds but that still caused the framework to enter “processing speech…” to almost 50 seconds with high CPU utilization. During this period, the framework does not generate any output, nor does it handle any input (iphone UI is still functioning though). This is re-producable.
=========================
2014-02-18 09:19:05.443 hear4me[3294:600f] Speech detected…
2014-02-18 09:19:05.446 hear4me[3294:60b] Pocketsphinx has detected speech.
2014-02-18 09:19:08.989 hear4me[3294:600f] Stopping audio unit.
2014-02-18 09:19:08.989 hear4me[3294:60b] Pocketsphinx has detected a second of silence, concluding an utterance.
2014-02-18 09:19:09.119 hear4me[3294:600f] Audio Output Unit stopped, cleaning up variable states.
2014-02-18 09:19:09.120 hear4me[3294:600f] Processing speech, please wait…
2014-02-18 09:19:58.316 hear4me[3294:600f] Pocketsphinx heard “WEIGHT VASCULAR” with a score of (-242625) and an utterance ID of 000000005.

February 18, 2014 at 6:46 pm #1020196

Halle Winkler
Politepix

OK, that’s very strange. Can you tell me how to modify the sample app that ships with the distribution so I can see the same effect?

February 18, 2014 at 7:02 pm #1020198

tyeh
Participant

I will try to reproduce it using the sample app, and send it to you.
-Thomas

February 18, 2014 at 7:09 pm #1020200

Halle Winkler
Politepix

Thanks! The simpler and clearer the modified part is, the earlier I will be able to run it and get back to you.

It would also be useful if you could give me one of your debug recordings that you saw trigger the behavior, so I can see what you saw firsthand.

You can send me a link or files under 12MB at the email address on your order confirmation and which is shown on the licensee site.

February 19, 2014 at 6:03 pm #1020223

Halle Winkler
Politepix

Hello Thomas and Steve,

Thanks very much for the clean and clear test case – it was very easy to integrate into my testbed and to replicate your issue. There are a couple of things going on. One is that I have learned that my fix for the VAD issue in 1.65 is a bit insensitive to word searches that for one reason or another are requesting a lot of resources like the one in this case, so I have made a new OpenEars beta version that it would be great if you could test at this link:

https://politepix.com/wp-content/uploads/OpenEarsDistributionBeta.tar.bz2

This will not fix the underlying problem but it will reduce the number of VAD recalibrations that will occur in your particular circumstance and it also fixes an issue that was leading to blank entries ending up in your language models, which could be a contributing factor in the weirdness. I appreciate that the quality of the test case made this easy to see and fix.

The big underlying issue here is that something appears to be wrong with the DMP file that is being used that results in a _really_ strenuous word search, in fact one which doesn’t exit, as you have seen. In your test case you generate a Rejecto language model but when you start listening you actually load a DMP that was already created in a previous session, out of mainBundle. When I use the Rejecto-generated model instead of the pre-created one, I do not encounter the issue. So to learn more about this, I’d ask you to take the following steps:

1. Upgrade to the beta at the link above, and then confirm that my observation is also the case for you, namely that when you use the generated Rejecto language model from your testcase rather than loading the old pre-generated one, the issue does not appear. If this is not the case and you see the same hang with the generated DMP and with the pre-generated one that your testcase currently loads, let me know here, but take steps to make sure you’re linking to the new version, the build is clean, the two plugins are definitely 1.65, etc, so we can be very confident that we saw different results. Assuming that you have the same observation as me:

2. Regenerate the problem DMP that your sample case defaults to running, the one that is currently in mainBundle and loaded with startListening:. Observe whether the issue still occurs with the newly-generated DMP or whether the improvement to the language model generator in the beta results in a DMP that doesn’t have the symptom. If the generated DMP still results in the symptom, next please test whether this issue is a result of loading in from a text file or whether it also happens if you create an NSArray in your app instead. Also please check whether the weight setting you use has an effect on the behavior. It is also worth it to take a look at the original text file and just see if there is anything notable or peculiar about it. Let me know your results.

If nothing you try in these steps results in a positive change or useful information, please send over the original text corpus file that the symptomatic DMP is being generated from, or let me know which file in the test case you sent is the original corpus, and I’ll take a look at the language model which is being generated and see if there is anything wrong there. Email is fine for that.

February 21, 2014 at 2:49 am #1020268

tyeh
Participant

Halle,
I downloaded and installed the beta dist and I see no VAD recalibration problem anymore so thank you for fixing this one.
On the other hand, the major problem in Rejecto is still there. I modified the test to use only one language model from the main bundle at one time. The two models are created separately using generateRejectingLanguageModelFromArray method as in the document from two simulator runs. Then I build them into the main bundle and test one at a time in a real iPhone 5s device. For both language models, I still see high CPU utilization when I said a medium or long sentence (5+ words) using unrelated words (out of dictionary words). I noticed the longer the sentence is, the longer the CPU stays at 100%.

Another question, I tried using:
LanguageModelGenerator *languageModelGenerator = [[LanguageModelGenerator alloc] init];
[languageModelGenerator deliverRejectedSpeechInHypotheses: (BOOL)FALSE];
To eliminate the __REJ words in hypothesis but I still got them no matter I use this function or not. The document indicates I should not need to use it because by default the rejected word will not be delivered. However, the hypothesis always contain them no matter this function is used or not, and if I use it, no matter the parameter is set to TRUE or FALSE.

-Thomas

February 21, 2014 at 4:03 am #1020269

tyeh
Participant

Halle,

I think I found the source of the problem. I used weight as an NSNumber=1.5 in the test. After reading your post/doc again I change it back to nil. Since then the CPU utilization has become a lot better. I also regenerate both language model but I suspect this is not the problem. Both corpus files are in the VocabularyFiles folder with .txt extension.

I was using weight=1.5 because the framework picks up too many out of vocabulary words with weight=nil. I was hoping increasing weight will make rejecto more effective but it seems to cause high CPU problem.

I am still struggling with __REJ (rejected words) being delivered in hypothesis. The function:
[languageModelGenerator deliverRejectedSpeechInHypotheses: (BOOL)FALSE];
seems to have no effect in my test setup (I tried using both TRUE or FALSE).
Thanks
-Thomas

February 21, 2014 at 11:54 am #1020273

Halle Winkler
Politepix

Hi Thomas,

OK, this was something I was also curious about. I will now start investigating why the high weight is causing searches that are hard to resolve (that could take a bit of time since it isn’t necessarily straightforward).

I am not seeing the issue with Rejecto phonemes being returned as hypotheses in your sample app when I set it to generate a new Rejecto DMP from your corpus mentioned above. I see them in verbosePocketsphinx output (expected) but not in the hypothesis callback of OpenEarsEventsObserver. Can you run the sample app you sent me but changing it to generate a new DMP from your corpus, and using the beta of OpenEars, and tell me if you are receiving __REJ words in the OpenEarsEventsObserver hypothesis callback?

February 21, 2014 at 5:02 pm #1020280

Halle Winkler
Politepix

Quick question: are you seeing this on 64-bit devices only, or older devices as well?

February 21, 2014 at 5:02 pm #1020281

Halle Winkler
Politepix

(Question above refers to the weight-related Rejecto DMP hang, not the rejecto phonemes being returned).

February 21, 2014 at 5:06 pm #1020282

tyeh
Participant

Halle,
__REJ problem did not happen in the first test I sent you. After I modify the test case based on your post to use DMP from the main bundle, it started to happen. If this is proper behavior for default debug output,how to turn if off?
I am sending you another email with link to download the X-code project containing the sample app with this symptom. The changes are very limited and controlled in ViewDidLoad method.
-Thomas

February 21, 2014 at 5:14 pm #1020283

Halle Winkler
Politepix

Hi Thomas,

Sorry for the confusion, my advice was to stop using the DMP from mainBundle which is what your test app does (although the first test app you sent me generates a dynamic model, it doesn’t actually use the generated model –– it loads one from mainBundle instead). My advice was to dynamically generate that model and use the dynamically generated model instead of using the old one that is found in mainBundle.

February 21, 2014 at 5:34 pm #1020284

tyeh
Participant

Halle,
Got it. Will do it now.
-Thomas

February 21, 2014 at 5:42 pm #1020285

Halle Winkler
Politepix

Super, consider using the new beta of OpenEars I just put up at the same link today. It has a fix that can make accuracy better. If you still see the hangs, let me know if you are seeing them with a 64-bit device only or with any device.

February 21, 2014 at 5:51 pm #1020286

tyeh
Participant

Halle,
Correct me if I am wrong:
1. I generate the first LM dynamically following the sample app. I just use Rejeco class instead. Then I mod the program to generate the second LM using a different “withFileNamed” parameter
2. I then go to the device (or simulator) Library/Cache directory to copy the resulting dic and DMP files (4 of them total), then add them the Vocabulary folder of the sample app project
3. Lastly I modified the sample app to use the newly added LM from Vocabulary folder (it is the main bundle), and comment the code does dynamic LM generation out.
Is this the right procedure?
The project download link was emailed to you.
Thanks
-Thomas

February 21, 2014 at 6:30 pm #1020287

Halle Winkler
Politepix

Hi Thomas,

Nope, I just mean that you should generate your Rejecto language model in the app and then use the one you generated. In your original sample app, it doesn’t use the generated language model to start listening – it uses one that was added to mainBundle instead. The idea with LanguageModelGenerator (with or without Rejecto) is to generate the model at runtime and use the freshly-generated model, because the generation process is extremely fast. The only time it would be necessary to save and load a model out of mainBundle would be if it was very large and had a notable generation time, but these models aren’t of that size.

February 21, 2014 at 11:03 pm #1020288

tyeh
Participant

Hi Halle,
I have mostly good news to report. I modified my app to dynamically generate both set of grammars as you suggested with Rejecto, and set the weight parameter of both to nil, with the latest beta dist. So far, the CPU utilization has been under controlled. Sometimes it went up high but would drop very quickly. I am happy to report there is no noticeable performance impact with this solution, and hypothesis does not contain rejected words either. I am curious as why rejected words are delivered when using pre-generated grammar. I do not need to pre-create grammars at this time but if its number increases, we may need to reconsider it in the future.

Another observation is weight. Rejecto still delivers too many out-of-vocabulary words. I tried to set weight to 1.3, 1, 0.8, 0.5 and 0. In every scenario, the CPU spike re-appears. It seems only nil will allow the framework to function properly.

If I can’t use weight, is there other method allowing tuning of the Rejecto behavior? Thanks

-Thomas

February 21, 2014 at 11:25 pm #1020289

Halle Winkler
Politepix

Hi Thomas,

OK, some high CPU is to be expected while a search is underway. Can you let me know if you are seeing the weight-related issue on 64-bit devices (e.g. iPhone 5S, iPad Mini Retina, iPad Air), or on 32-bit devices (every other physical device), or both? If you let me know I might have an idea about when this issue crept in.

February 21, 2014 at 11:52 pm #1020290

tyeh
Participant

Halle,
I tested on a iPhone 5s and an iPad Mini first gen (I believe this is a 32 bits device) they both show the symptom if weight is not nil. They both run iOS 7.0.4.
-Thomas

February 22, 2014 at 10:52 am #1020291

Halle Winkler
Politepix

OK, and which weight settings have you tried specifically when you use a weight? I think the issue might be that with the entire assortment of phonemes and a high weight for them alone, there is too much perplexity in terms of what any given long sentence could be, which leads to long searches.

There might be a tipping point number where you get higher weight without the really long searches. I want to get a long-term fix for this but I am not going to be able to fix it immediately because it involves fair amount of research, but in the meantime, you can try bisecting to see if you can locate a suitable weight value that gives the results you want:

Default weight is 1.0 and the weight you are using is 1.5. So take 1.25, and if that is fast but not rejecting enough, go up by half (i.e. to 1.375) or down by half if the value is rejecting enough but too slow (i.e. to 1.125) until with some luck you’ve encountered a value that works well. If you have values which are too slow and also insufficiently rejecting, that suggests that the goal isn’t possible right now and we’ll have to figure this out as I can research the underlying causes, but please give it a try so we can find out if we can get your case working well while I look for a lasting fix. If you find a good value please let me know since it might be that my limit for the maximum weight is too high.

February 22, 2014 at 10:57 am #1020292

Halle Winkler
Politepix

BTW, please keep in mind that Rejecto, like any other OOV solution, can’t give perfect rejection results – the goal is to minimize out-of-vocabulary recognition but it isn’t possible to make OOV recognition 100% nonexistent without affecting in-vocabulary recognition (even Siri will catch incidental noises and extraneous speech and recognize it sometimes). If you have very specific OOV words which are frequently mistaken for in-vocabulary words you can also add them to the active model and then ignore them in hypotheses.

February 25, 2014 at 2:15 am #1020312

tyeh
Participant

Halle,
After lots of bisecting tests, I get weight=1.2. This value seems rather stable in term of CPU utilization. If I went over 1.2, my app will lock into long search quite easily.
In terms of OOV rejection, I think I will need to do some app level optimization and not rely on Rejecto framework 100%.
Thanks for all you helps.
-Thomas

February 25, 2014 at 10:48 am #1020321

Halle Winkler
Politepix

Hello Thomas,

Thanks very much for sharing that info. I think that the underlying issue could be that I set the ceiling for the weight setting too high, and that info will be helpful to understanding where the line is in a general sense.
Author

Posts

Viewing 43 posts - 1 through 43 (of 43 total)

You must be logged in to reply to this topic.