Forum Replies Created
This is currently undocumented and in normal use you should never need to interact with the AudioSessionManager directly, but see if it helps you to send this message to the audio session singleton from your app before using FliteController:
[AudioSessionManager sharedAudioSessionManager].soundMixing = TRUE;
No problem, just go to https://www.politepix.com/openears and search the page for “verbosePocketSphinx” and you’ll see the property definition. There is also an example of using it in the sample app view controller if you search for that string.
There are a lot of things here that are an issue. updateLevelsUI in the sample app is just intended to be a way to show how you can read the read-only levels property on a separate thread so it doesn’t block. It is called many times a second. That means that your slider/volume code is being continuously hit whether it is being interacted with or not as the method just tries to read the read-only flite level property and display it in the UI, and it also means that your volume is being changed from the thread that the AVAudioPlayer it addresses is definitely not on. I would first totally uncouple your volume changing/volume slider code from the UI updating example that is in the sample app. You should be able to change the volume and update your UI whose purpose is changing the volume on mainThread in a normal method which returns IBAction and which only is concerned with letting the user change the volume. The updateUI method can continue to just handle reading the level and displaying it in the UI.
If the “Welcome to OpenEars” statement continues to be played at unexpected times after that, that probably means that something in the code is causing an interruption that causes the entire listening loop to be reset, resulting in pocketsphinxDidCompleteCalibration being called, which in the sample app results in “Welcome to OpenEars” being spoken.
I don’t think that logging has verbosePocketSphinx enabled. If it does, that means that your app has an issue that is blocking before [self.pocketsphinxController startListeningWithLanguageModelAtPath:lmPath dictionaryAtPath:dicPath languageModelIsJSGF:NO];, not during it, since the logging would show it starting but stopping somewhere, but this logging shows nothing that happens after the language model is generated. I recommended showing the relationship between your picker code and the OpenEars code before — without that or the output with verbosePocketsphinx there’s no way to know what is happening since the code above is the code which works in the sample app.
What is the relationship between the picker code and the code above? Probably it’s not the case that startListeningWithLanguageModelAtPath: doesn’t trigger but rather that it gets to a certain point in the loop and has trouble. If you turn on verbosePocketSphinx and OpenEarsLogging the output will probably tell you a lot about the reason that startListeningWithLanguageModelAtPath: isn’t getting good results. You can search the log output for the words “error” or “warning” specifically or you can post it here (but please make sure both forms of logging have been turned on first so I can really see everything that is happening).
This question is a little too broad for the forum, sorry. Feel free to ask specific questions about your code.
I’ve fixed the NSLog statement for the next version so the sample app doesn’t create confusion about the framework behavior and updated the online documentation and tutorial.
Take a look at the GraphemeGenerator.h source further, it is there for getting phonemes out of words. That’s really all the help I can provide on this one, sorry.
Not sure what the issue is, but in my experience the phonemes-only speech is harder to understand and more unpleasant to listen to than the basic speech so I don’t recommend bothering. Definitely do your experiments using a better voice than KAL since there’s no way of knowing how much of its comprehensibility comes from the features that are removed when doing phoneme-only speech such as all variance and inflection.
The log always says “a second of silence” because that’s just what an NSLog statement says in the sample app. It isn’t related to the functionality of the property secondsOfSilenceToDetect and the log statement doesn’t come from the framework.
secondsOfSilenceToDetect defaults to .7 seconds currently and if you change it it will be shorter or longer, but the difference between .7 seconds and for instance .33 isn’t going to be a big perceptual difference (although the very short delay can cause issues since any intermittent noise followed by a pause can trigger recognition) because you will still have the following sequence of events which all use time: the speech continuing until to completion, the silence after the complete speech, and then the time to process the complete speech.
RapidEars doesn’t use a period of silence at all because it recognizes speech while the speech is in-progress rather than performing recognition on a completed statement (for instance, if you say “go right” it will first return the live hypotheses “go” and then “go right” as you are in the process of speaking the phrase — RapidEars doesn’t wait for a silence period to recognize). For your goal of using OpenEars-style speech recognition that only happens after a silence but with a shorter silence period it isn’t necessary for you to use RapidEars. But, since OpenEars defaults to a short period of silence out of the box, the differences from shortening it more than the default aren’t going to be dramatic; expect it to be a smaller change in the user experience.
I believe that this would now be possible using RapidEars: https://www.politepix.com/rapidears
Thanks for the heads-up that this isn’t self-evident, I have added it to the documentation.
Correct, that was actually the origin of the crash — when there is a null hyp in nbest, Pocketsphinx doesn’t malloc an nbest structure at all, and my function was unaware of that and trying to release the same number of nbests as was requested. Seeing fewer than your maximum is a sign that everything is working.September 26, 2012 at 9:51 am in reply to: PocketsphinxController plus Nuance SDK breaks AudioSessionManager in iOS6 #11302
Thanks :)September 26, 2012 at 9:48 am in reply to: PocketsphinxController plus Nuance SDK breaks AudioSessionManager in iOS6 #11300
If you don’t mind, I’d like to change the title to mention the Nuance issue so it’s Google-able and so that casual readers of the forum don’t get the impression that OpenEars isn’t compatible with iOS6, OK with you?September 26, 2012 at 9:46 am in reply to: PocketsphinxController plus Nuance SDK breaks AudioSessionManager in iOS6 #11299
Yup, that matches my expectation. Here is what I would guess is going on:
There are a few different iOS objects which take over the audio session and do not return it to the state it was in after they are no longer instantiated. Specifically, some of them turn off recording. So if you are responsible for an SDK that does continuous listening (e.g. both OpenEars and Nuance) you will get a lot of complaints about it ceasing to work after using AVPlayer or similar, because the iOS object removes the audio input. So you will put in a sanity check that makes sure to fix the audio session before your product does its thing.
OpenEars does its sanity check when it re-opens the audio unit so that it is possible to play video while speech recognition is in suspended mode, since this is a very frequently-requested feature.
The call you make to the Nuance object above almost certainly initializes their own listeners for audio session changes, which they deal with in their own way.
I think that what you are seeing is that OpenEars is performing its sanity check before it tries to open its audio unit, discovers that Nuance has changed the audio session, fixes it, and Nuance discovers that OpenEars has changed the audio session, fixes it, comedy ensues.
I can’t speculate about why it’s only happening with iOS6 but it could easily be a race condition that was always there but was resolving differently in iOS5.September 25, 2012 at 6:01 pm in reply to: PocketsphinxController plus Nuance SDK breaks AudioSessionManager in iOS6 #11295
Another approach for locating the conflict would be to make a copy of your app and to start removing functionality in stages until PocketsphinxController starts working again, at which point you can suspect the last thing your removed as at least a partial cause. I would be very interested in more info about this since it is certainly not something I’m blasé about if it is an interaction with OpenEars.September 25, 2012 at 5:41 pm in reply to: PocketsphinxController plus Nuance SDK breaks AudioSessionManager in iOS6 #11294
OK, since that means that it is something relating to code that I don’t have access to, let me explain what is weird in the logging and maybe it will point you towards what to check out.
Early on in the app session your category is set (correctly) the first time and we’d normally expect it to remain that way for the rest of the app session unless it is overridden by a call to audiosession or AVAudioSession or the use of a media object like AVPlayer or MPMoviePlayerController:
2012-09-25 15:38:46.061 test[819:907] audioCategory is now on the correct setting of kAudioSessionCategory_PlayAndRecord.
Then some very normal stuff is done to bluetooth, default to speaker, etc, with no errors returned. At this point the audio session settings look very normal and if an attempt were made here to start the audio unit I would expect it to work.
Then the device has some kind of DNS library issue which might be unrelated:
2012-09-25 15:38:46.194 test[819:907] [NMSP_ERROR] check status Error: 696e6974 init -> line: 318
However, if this is an automatic function of the Nuance SDK it is a sign that its objects might be instantiated at this point.
PocketsphinxController continues starting up and gets as far as it can without any errors before it needs to start the audio unit, which is when things get weird:
2012-09-25 15:38:46.407 test[819:907] Audio route has changed for the following reason:
2012-09-25 15:38:46.408 test[819:907] There has been a change of category
2012-09-25 15:38:46.409 test[819:907] The previous audio route was SpeakerAndMicrophone
2012-09-25 15:38:46.535 test[819:907] This is not a case in which OpenEars performs a route change voluntarily. At the close of this function, the audio route is Speaker
2012-09-25 15:38:46.536 test[819:907] Audio route has changed for the following reason:
2012-09-25 15:38:46.541 test[819:907] There has been a change of category
2012-09-25 15:38:46.542 test[819:6607] Set audio route to Speaker
2012-09-25 15:38:46.542 test[819:907] The previous audio route was MicrophoneBuiltIn
2012-09-25 15:38:46.544 test[819:6607] Checking and resetting all audio session settings.
2012-09-25 15:38:46.545 test[819:907] This is not a case in which OpenEars performs a route change voluntarily. At the close of this function, the audio route is Speaker
2012-09-25 15:38:46.546 test[819:6607] audioCategory is incorrect, we will change it.
2012-09-25 15:38:46.782 test[819:6607] audioCategory is now on the correct setting of kAudioSessionCategory_PlayAndRecord.
2012-09-25 15:38:46.786 test[819:6607] bluetoothInput is incorrect, we will change it.
2012-09-25 15:38:46.791 test[819:5c03] 15:38:46.792 shm_open failed: “AppleAURemoteIO.i.724ba” (23) flags=0x2 errno=2
2012-09-25 15:38:46.795 test[819:5c03] 15:38:46.795 AURemoteIO::ChangeHardwareFormats: error 3
2012-09-25 15:38:46.796 test[819:5c03] 15:38:46.797 shm_open failed: “AppleAURemoteIO.i.724ba” (23) flags=0x2 errno=2
2012-09-25 15:38:46.797 test[819:5c03] 15:38:46.798 AURemoteIO::ChangeHardwareFormats: error 3
2012-09-25 15:38:46.805 test[819:6607] bluetooth input is now on the correct setting of 1.
2012-09-25 15:38:46.808 test[819:6607] categoryDefaultToSpeaker is incorrect, we will change it.
2012-09-25 15:38:46.808 test[819:907] Audio route has changed for the following reason:
2012-09-25 15:38:46.810 test[819:907] There has been a change of category
2012-09-25 15:38:46.811 test[819:907] The previous audio route was Speaker
2012-09-25 15:38:46.893 test[819:6607] CategoryDefaultToSpeaker is now on the correct setting of 1.
2012-09-25 15:38:46.895 test[819:6607] preferredBufferSize is correct, we will leave it as it is.
2012-09-25 15:38:46.897 test[819:6607] preferredSampleRateCheck is correct, we will leave it as it is.
2012-09-25 15:38:46.896 test[819:5c03] 15:38:46.896 shm_open failed: “AppleAURemoteIO.i.724ba” (23) flags=0x2 errno=2
2012-09-25 15:38:46.898 test[819:6607] Setting the variables for the device and starting it.
2012-09-25 15:38:46.900 test[819:6607] Looping through ringbuffer sections and pre-allocating them.
2012-09-25 15:38:46.895 test[819:907] This is not a case in which OpenEars performs a route change voluntarily. At the close of this function, the audio route is SpeakerAndMicrophone
2012-09-25 15:38:46.899 test[819:5c03] 15:38:46.899 AURemoteIO::ChangeHardwareFormats: error 3
2012-09-25 15:38:46.907 test[819:907] Audio route has changed for the following reason:
2012-09-25 15:38:46.908 test[819:907] There has been a change of category
2012-09-25 15:38:46.909 test[819:907] The previous audio route was ReceiverAndMicrophone
2012-09-25 15:38:47.497 test[819:6607] Started audio output unit.
2012-09-25 15:38:47.499 test[819:907] This is not a case in which OpenEars performs a route change voluntarily. At the close of this function, the audio route is SpeakerAndMicrophone
I count 4 category changes and 4 route changes resulting from it, all in approximately one second. That seems like something trying to override the OpenEars settings in an automatic way. You can see that at the moment that the audio unit begins, it isn’t set to a category which has a recording input, because the next thing that happens is that OpenEarsLogging announces that there was a route change to a route that contains a recording input (the last line above). The fact that the audio unit is trying to start at a time in which it seems like it has a mic input, then it disappears while the unit starts, is probably the origin of the crash.
Are you completely positive that there is nothing going on with your Nuance SDK at the time that PocketsphinxController has started? Because that would make a lot of sense as the source of objects which are listening for audio session changes and automatically reacting to them. Otherwise, do a case-insensitive search of your app for “audiosession” and see if there are any AudioSession or AVAudioSession calls made by the app. You might also want to look for AVPlayer, AVAudioRecorder and/or MPMovieController objects (or other objects that assert their own audio session settings) that are instantiated at that time. I will keep looking into it in the meantime so please let me know if you find a cause in your app.September 25, 2012 at 4:57 pm in reply to: PocketsphinxController plus Nuance SDK breaks AudioSessionManager in iOS6 #11292
Do you see this on the same device and OS with the sample app?September 25, 2012 at 3:49 pm in reply to: PocketsphinxController plus Nuance SDK breaks AudioSessionManager in iOS6 #11290
OK, I’ll check it out and get back to you.September 25, 2012 at 3:46 pm in reply to: PocketsphinxController plus Nuance SDK breaks AudioSessionManager in iOS6 #11288
Which iPhone?September 25, 2012 at 3:35 pm in reply to: PocketsphinxController plus Nuance SDK breaks AudioSessionManager in iOS6 #11286
OK, I haven’t heard of this before but it looks like something is going awry with the audio session. OpenEarsLogging turns on verbosity for the audio session, can you run your test again with [OpenEarsLogging startOpenEarsLogging] running and show me the output? Is there anything special about the app as far as the audio session or other media objects goes?
You can also just change the variance on the SLT voice to a very low value in order to get that zero-inflection phoneme effect.
Hi, sorry for the fact that I didn’t see this.Here is a function I use in grapheme generator to obtain phones for arbitrary text:
const char * flite_text_to_phones(const char *text,
const char *outtype)
const char * phones;
u = flite_synth_text(text,voice);
phones = print_phones(u);
But this of course involves two synthesis passes. I do it with a really fast voice in OpenEars so it isn’t that arduous but it’s probably still noticeable.
If I recall correctly, the phonemes used in Flite are the same ones used in Pocketsphinx with the exception that Pocketphinx’s ah needs to be turned into ax.
Check out the FAQ for these answers:https://www.politepix.com/openears/support
I think it’s fine to use the text I mentioned in the FAQ as long as you link to the CMU license somewhere (either to a reprint of it on your site or on one of their sites). Conforming to the CMU license isn’t something I can speak definitively about since it isn’t my license but in my experience the goal isn’t to put long licenses into your apps but to make sure that app users can access the licenses. Thanks for paying attention to crediting and licensing!September 14, 2012 at 4:03 pm in reply to: [Resolved] How to reject out-of-vocabulary utterances? #11115September 10, 2012 at 3:33 pm in reply to: Keeping audio running when app goes into background #10999
Awesome, good to have the confirmation.September 10, 2012 at 2:55 pm in reply to: Keeping audio running when app goes into background #10997
OK, good luck!September 10, 2012 at 2:48 pm in reply to: Keeping audio running when app goes into background #10995
I’m surprised to hear that it’s possible and in the thread it’s actually the other developer who has a method, so I can’t really help with implementing his approach unfortunately.September 10, 2012 at 2:34 pm in reply to: Keeping audio running when app goes into background #10992
This is the only information I have on the subject:
On the engine side of things, only if you use JSGF, but this will significantly slow down your recognition and IMO it also reduces accuracy.
You can also just screen your hypotheses for the results you are looking for, i.e. if you receive something other than the complete phrase in the order you expect it, ignore it.
OK, but this is a standard complaint about the KAL voices and one I’ve rarely heard about the higher-quality 16-bit voices:
I think the current TTS isn’t always accurate and sometimes hard to understand unless the word is in a sentence
There is no OpenEars function to just say phonemes. If you’re handy with C and want to read up on the Flite public API, you can change FliteController’s implementation of Flite to accept an input of phonemes instead of words and use Flite’s flite_synth_phones function on a returned CST utterance that then needs to be turned into a CST wave, and recompile the framework to give your app access to the changed method. It’s possible but the steps involved are unfortunately outside of the support scope of this forum.
Sorry, this isn’t currently possible, but I’ll take it under advisement as a useful future feature.
Why are you doing your own Audio Session management (serious question, maybe there is a good reason for it despite it being in conflict with the OpenEars instructions)?
Right, but you must be using it during the recognition activity because otherwise the AVPlayer audio session would completely override the OpenEars audio session, so my interest is in how you are using it so it is possible for its playback settings to conflict with those of the OpenEars audio session.
It isn’t actively changing the sample rate for playback, it is using the required recording and playback audio session type with a 16k record rate, which might override the playback rate as an unintended side effect. It’s actually a bit surprising to me that the playback rate of a media object is being affected at all, can you show me your object playback code as a test sample so I can replicate and look into it when there is time?
Just for some background on why it’s like this, for speech perception purposes there isn’t a big improvement in perception for higher sampling rates than 16k (and mono), which means that most speech recognition software will attempt recognition with a maximum of a 16k sample rate because it means there are far fewer samples that have to be analyzed. For non-speech applications such as music it’s naturally always going to be be better to use a higher sample rate and stereo if possible. But generally, even for speech that humans listen to, you also don’t get a lot of extra “bang for the buck” for going from 16k to 44.1k because the comparison standard is telephone bandwidth, which is generally standardized at 8k and compressed, making 16k PCM already a big step up. The reason that the recognition is compromised is that it assumes that a “chunk” of speech is likely to occur within a certain number of samples in a timeframe, and it’s more like 3x the samples in which the speech is occurring, so it is really not going to map well to the recordings which are in the acoustic model (which are actually 8k but the input functions compensate for the doubling of the input rate)
OK, that’s your call. I think that the perceived speech as far as pocketsphinx is concerned will seem quite different but I’ve also had the experience that it does perform the recognition after all, but there is a big loss of accuracy. For a small vocabulary it’s true that you might find it tolerable regardless so I’m glad to hear it works all right for your application. Do me a favor and mention your override in future support questions so that I can distinguish between potential issues that are normal and potential issues which could be a side-effect of your change.
Will this have an influence on OpenEars?
Yup, see my answer that slipped in ahead of your last post.
Why do you need to make a CD-quality recording using the same stream that Pocketsphinx is using?
Ah, I understand now, you’re using a full 44.1k rate and PocketsphinxController requires (really requires) a 16k rate. If you convince it not to sample at 16k you will reduce the recognition accuracy severely. You’re correct that 16k recordings won’t sound as nice as 44.1k (CD quality) but if Pocketsphinx analyzed a 44.1k recording it would take forever.
Yup, it’s the second-worst voice out of eight. I think it would be a good use of time to brush up on the documentation about the different voices and try the better ones first.
Which voice are you using?
variance is the degree to which inflection is given a perceived sense of randomness.
Can you describe the reduction more specifically? It shouldn’t be possible for the bitrate or sample rate to be changed so I’m unclear on what aspect of playback is different. You can’t use PocketsphinxController without the audio session settings it needs.
You could try keeping your sample buffers and writing them out to a WAV file and submitting the WAV file to the runRecognitionOnWavFileAtPath: method. You won’t get voice audio detection/continuous recognition but you can submit the speech at the end of the capture.
I don’t think that is going to be trivial. I’m sure it is in some way possible but I doubt it can be done while enjoying any of the convenience functions of AVCaptureSession or AudioSessionManager/ContinuousAudioUnit. It’s unfortunately outside of the scope of the support I can give here.
There’s only one audio stream, it can’t be streamed into two objects simultaneously.
Ah, gotcha. OK, glad it’s working for you!
You just need to get the verbosePocketSphinx logging turned on and it will tell you what is going wrong.
Are you capturing audio at the same time as you are trying to do speech recognition?
OpenEarsLogging and verbosePocketSphinx aren’t related. OpenEarsLogging logs the basic functionality of the audio driver etc, and verbosePocketSphinx logs what is going on under the surface for pocketsphinx, which is where your issue is. I don’t think it’s possible that you won’t get any new logging output when you turn on verbosePocketSphinx since the crash is occurring after pocketsphinx is starting. Please double-check that it is turned on so you can show your logs.
Acoustic model and language model is generated dynamically, so this shouldn’t be missing.
The language model can be generated dynamically, but the acoustic model is part of the “framework” folder that has to be dragged into an app and cannot be dynamically generated. My guess is that the acoustic model isn’t in your new app.
To find out why PocketsphinxController is crashing, set verbosePocketSphinx to true. It probably can’t find all or part of the acoustic model or the language model in your new app.
I don’t see that as a big performance issue for audio of that length.
What specifically do you think would be an I/O issue?
(I should say: I don’t expect that there is any way, which doesn’t mean that it’s impossible, just that my educated guess is any workaround will lead to more problems down the road than it solves right now).
You can’t do that with secondsOfSilenceToDetect. You can fake the first part by immediately suspending listening when listening begins (using the relevant OpenEarsEventsObserver callbacks) and then unsuspending it when you want to begin your arbitrary interval. But there is no way to force recognition/avoid voice audio detection submitting recognition in its own time.
My first suggestion would probably work very similarly to your wish though — instead of starting up recognition and then starting a timer, start a timer that starts an AVAudioRecorder and when your timer runs out, submit the PCM audio to runRecognitionOnWavFileAtPath. It should be functionally the same as what you want as far as I can tell.
It’s only possible to do voice audio detection recognition with OpenEars on recordings using its driver. What you could try is to make a WAV recording of the speech and then submit it at the end to the method runRecognitionOnWavFileAtPath:usingLanguageModelAtPath:dictionaryAtPath:languageModelIsJSGF:
Sorry I don’t have a lot of info on hand about the way that the scoring works — I made a decision to pass the score along via the API because it was available as data and it seemed like overreaching to decide not to pass it back through the callback, but based on my discussions with the CMU Sphinx folks it doesn’t provide a lot of viable info for language models of the size that are appropriate for iPhone apps so I haven’t done a lot of investigation of its intricacies myself. My general advice is that you should base any logic that makes use of the score on data that emerges from well-organized and diverse tests rather than an interpretation of the scoring method.
I would just recommend looking into the source in the framework project to get the exact formula/e.
recognitionScore is equivalent to confidence score, but in pocketsphinx that is extraordinarily dependent on size of language model and environment and speaker, meaning that you have to be very conservative about using it for any program logic. I recommend testing a lot under many circumstances before deciding how to use scoring or whether to use it.
For n-best, are all the scores not being returned?
I would recommend reducing it and doing some user testing to see what the minimum is for your application before you have an issue with utterances being cut off.
This is the authoritative discussion on this, but my impression is that it’s a bug or questionable feature of the audio session: https://www.politepix.com/forums/topic/keep-system-sounds-while-listening/
I unfortunately don’t have any insight into this issue beyond the thread discussion.
Say, would you be so kind as to remove the salty language in your language model when you post your logs? I don’t have a content filter on the forum because it’s almost never needed, but it’s better kept out of Google’s index for the site. Thank you!
There’s no built-in mechanism for changing the audio session category inside the framework (in fact, it’s required by the framework that you let it set its own audio session settings for good results) but you can always make your own calls to the audio session using standard AVAudioSession methods in parts of your app which don’t need to actively use OpenEars classes.
I don’t really want to give advice here regarding forcing the audio session to reset because in 99% of cases the issue is not due to the audio session and folks will read the steps here and do stuff to the audio session directly and end up with messed-up apps that are very confusing for me to troubleshoot. That said, in this one case it might be worth your while to go investigate how the shared audio session manager is started by the internal classes and give it a go.
OK, the issue is that the video player completely changes the audio session, so if you continuously play a video while PocketsphinxController is suspended, it guarantees that its built-in audio session reset behavior won’t work. I think the only option is to find a way to do what you need to do without always running a video.
It might not be an actual mistake, but possibly some kind of limit to how well OpenEars can override the audio session with respect to the timing of your video if it’s close. What is in your log excerpt isn’t anything bad, but it might be helpful to see the whole thing (minus your own app logging which I don’t need to see). It does sound like the video might be changing the audio session and your recognition loop has quiet input as a result or possibly a wrong sample rate or something similar.
Are you playing the video before starting the recognition loop or during it?
Can you turn on OpenEarsLogging and show the log? Is the same issue there if you use the sample app and make “NO” one of the dynamically-created words in the dynamic model without any of your video code?
You could try RapidEars and see if it helps if you’re open to non-free solutions. If I recall correctly, your implementation isn’t a supported method, so you might have audio session problems.
Sure, check out the float property of PocketsphinxController “secondsOfSilenceToDetect”. I just moved it into the class so you could set it programmatically.
What I would like to know is the order in which the delegates are called.
Good question — this is really pretty dependent on what is happening/what you are doing. It isn’t so much that there is a particular order to expect but that there are particular events which will result in a certain delegate callback. The basic thing that you will see is the start event, then lots and lots of updates of the live speech event (as you mention) followed by a finalized speech event.
However pocketsphinxDidStopListening doesn’t appear to be called, should this not be called as some point before pocketsphinxDidStartListening is called? Or should pocketsphinxDidStartListening not be called except for the very first time?
I think this is an example of flawed naming on my part — pocketsphinxDidStartListening and pocketsphinxDidStopListening are not actually analogs. pocketsphinxDidStartListening is called when entering the listening loop, pocketsphinxDidStopListening is called when turning off the recognition engine finally.
What causes rapidEarsDidDetectFinishedSpeechAsWordArray to be called? Does it still work on the second of silence?
Correct, there are lots of attempts to recognize during the speech, and then once there is a pause there is a finalized, higher-accuracy attempt that is very similar to the default recognition behavior of OpenEars. It can be turned off (and should be turned off to save a few cycles) if you are only interested in the live speech but I left the option in there of using it so you aren’t excluded from the old-style pause-based recognition if you choose RapidEars. You can turn it off by setting this:
Just to confirm but pocketsphinxDidReceiveHypothese should no longer should be callled?
What sort of delay if any will be caused when it’s switching between these states? I’m mainly interested in trying to find out if any words will be lost if they are said between rapidEarsDidDetectFinishedSpeechAsWordArray and pocketsphinxDidStartListening being called, how is the reconigition loop affected, should I make the user wait before continuing to speak?
Just like with OpenEars, the engine is not taking in new audio while it is performing that pause-based finalized recognition (if you tell it to stop finalizing the expected behavior is that it shouldn’t have gaps in listening — let me know if that isn’t the case). But there shouldn’t be a delay in the time between returning the hypothesis and going back to listening.
Excellent :) . Give it a try on a device for the best recognition quality (the simulator recognition is not so great).
I think for clarity I would prefer it, at first I thought I was doing something wrong.
OK, good to know that should be improved. I will need to take a look at your project in-depth a little later when I can reattach the references, but the only thing I noticed right off the bat is that you’ve added the -ObjC flag correctly but only for release and not debug. I guess if you’re then running it in debug it will probably be sad :) . Is it possible this is the issue?
OK, that line is just intended to refer to changing all of the incidents of the old-style recognition start to the new-style recognition start (since it occurs many times in the sample app) in the same way as the example that comes right before it, but if it’s unclear as-is I can revise it to be clearer, thanks for the feedback.
All righty, I’m not sure what the issue is with the project so could you put up your modified sample app somewhere for me to download and take a look at? You can remove the OpenEars framework from the sample app folder to save file size if you want (but don’t remove the RapidEars plugin since I need to see how that is connected to the project). Thanks!
Based on the log, the plugin isn’t being used at all in the project since the error is that a method that is in the plugin isn’t available to the project. Can you elaborate on your comment:
had to adapt the guide slightly this line is not right:
“Then replace all of the other occurences of startListeningWithLanguageModelAtPath: with startRealtimeListeningWithLanguageModelAtPath:”
Instead I replaced it with
[self.pocketsphinxController startRealtimeListeningWithLanguageModelAtPath:self.pathToGrammarToStartAppWith andDictionaryAtPath:self.pathToDictionaryToStartAppWith]; // Starts the rapid recognition loop.
Since the instructions say to replace all of the references to startListeningWithLanguageModelAtPath: with startRealtimeListeningWithLanguageModelAtPath:, which appears to be what you did? I’m still confused about what part of the instructions you are saying wasn’t right and I think maybe this could be related to the issue you’re experiencing.
My impression is that the plugin isn’t added to the project target, or it isn’t being imported into the class in which you are using it by following these lines from the instructions:
1. Open up ViewController.m in the editor and up at the top where the header imports are, after the line:
add the following lines:
Sorry you are having difficulty integrating the plugin. Please turn on logging and show the output so I can assist — the instructions are known to work so there must be a minor implementation issue which the OpenEars logging may assist with. Here is a link to how to turn logging on: https://www.politepix.com/openears/yourapp/#logging
Can you elaborate on what the difference is between your correction and the original instructions? They look like they are the same to me but maybe I am overlooking something.
Just checking since I recall you are using an earlier version of OpenEars, did you follow the instructions to update to OpenEars 1.1 before adding RapidEars?
Sounds a little big for local recognition (I think 500-1000 words is probably better) but the only way to know for sure is to test.
Try setting n-best to return the 1 best hyp with a score and I think you should receive the score per word.
Take a look at the sample app and search for “nbest”.
Sounds like a nice app, always happy to hear about that kind of use of OpenEars. This is more of a general application design question but I don’t mind taking a stab at it.
I think that the most efficient way to deal with this kind of issue is to launch the follow-up method (in this case whatever the “ask next question” method is) _from_ fliteDidFinishSpeaking (as part of an OpenEarsEventsObserver instance that is instantiated in the class whether the follow-up method lives). Generally, a good pattern for that kind of approach is to have a “queue” of questions which are added to an NSMutableArray at whatever point you know what they are supposed to be, and every time fliteDidFinishSpeaking is called you check the queue to see if there are any questions left in it to ask. If there is a next question in the queue, you launch your question method using that question from fliteDidFinishSpeaking and also remove it from the queue. Eventually you will run out of questions and fliteDidFinishSpeaking will not result in another question being asked. If logic dictates that there are new questions that should be asked, you add them to the queue. Does that make sense?
What do you mean by pausing the main application?
It’s very much relative to that particular speaker and session and the size of the language model. It can’t be reliably used to create arbitrary cutoffs in my experience except perhaps with extremely low values (like -500000 or lower). I wish it were more useful for the task of evaluating accuracy but I haven’t encountered a case yet where it was possible to rely on the score so I’ve reversed my old advice that it could be used in this way.July 3, 2012 at 8:25 am in reply to: Identifying the time when a particular word is spoken #10027
There is no feature that can do this, sorry.
OK, so that means that the issue is that it used to work because your AIR app was overriding the audio session in some way that it requires after OpenEars first established it, and it had the random luck to not break recognition (or if it had a negative effect on recognition, which is possible, we aren’t directly aware of it) but now that OpenEars does a sanity check for the required settings at every recognition round, the AIR app doesn’t control the audio session setting so it can’t do its audio playback in whatever the form is that it requires.
You have a fix ,which is to break AudioSessionManager, but this will also probably have negative effects on OpenEars’ performance. However, the previous AIR overriding that was working with the versions before 1.0 most likely also had the same effects so it might not seem like a problem. My random guess about the iPad is that of all your 4.x compatible devices, it may be the only one with only a single mic and whatever the AIR audio session settings are, they may force recognition to occur on a second mic.
I’m a bit confused by the description “main app”, are there multiple apps in some sense?
What kind of sounds are these, sounds you are actively playing or system sounds?
Hmm, just to rule out any of the early bugs, can you upgrade to 1.1? I don’t really think that is your issue but it would be a good first step to set the level. Also it has nice new features, one of which is easier logging which might show us if there are any errors in the audio session manager.
Have you read the instructions on safely upgrading? https://www.politepix.com/forums/topic/uninstalling-your-pre-1-0-openears-library-install-before-installing-1-x/
This is not a good application of the library, unfortunately.
The new version is up (I haven’t had time to update the detailed documentation so I haven’t announced it yet but you can see it at https://www.politepix.com/openears) so you can give it a try. There is a preprocessor define that turns on n-best in the new sample app so you can uncomment it to experiment with n-best and scoring.
ARPA is probabilistic and increasing the probability of a phrase to 100% would break it as designed. What you are looking for is JSGF, which is a rules-based grammar. If you search for JSGF on this forum you should get a lot of starting info to research it further.
OK, I think that the upcoming n-best feature should handle this for you but there isn’t anything in the current version which I can think of which will help. I’m hoping to release the next version around next Monday.
OK, to clarify my own understanding of the issue, is what is happening that the actual phrase is “call my friend Maxim” and what is being reported is “Molly Glen Maxim” or something along those lines with the non-name words being replaced by similar-sounding names? Or is the issue that the rest of the sentence is being recognized correctly but you want to disregard non-name words which are in your language model?
The end goal is just to be informed that “Maxim” was detected in the sentence without needing to know the specifics about the other words, is that correct? I don’t think there is a way to get per-word scores for a multiple word sentence, but n-best scoring will be coming up in the next version of OpenEars.
Congratulations on your app! Feel free to promote it in the sticky topic at the top of the forum once it’s accepted.
The license is at the root of the distribution, it’s called license.txt. It has some credit boilerplate language.
Nope, no gender switch. The most effective thing I can recommend would be adapting the acoustic model with a large set of speech recordings of female speakers, using only speech related to your language model:
No problem. There is another potential complication that isn’t immediately obvious but that I’ve been trying to make a point of mentioning more frequently here, which is that a lot of developers specify apps with the idea that the device can be pretty far away from the user, but this actually gives the device speech recognition task an additional disadvantage that a desktop speech recognition application would be unlikely to have: a big mismatch between the design of the available microphone and the use that is being made of it. You can even see this with Siri if you open Notes and do dictation from a distance; return time from the server will get slower and accuracy will decrease because the iPhone mic is designed to be spoken directly into and to reject “background noise” which might be your user if they are far enough away and there are competitive sounds.
This isn’t as big a deal with command and control language models/grammars, but as soon as you’re past 20 words or so you can start to see an impact. So another approach is to see if you can educate your users to not put too much distance between themselves and the device during app use.
Yes, recognizing numbers in isolation seems to be a difficult task for speech recognition engines.
1) Trying to create a better language model by using a different toolkit such as SRILM MITLM or IRSLM
3) Using LanguageModelGenerator
Most language modeling software uses a set or subset of a few existing algorithms, so I don’t think you need to do a lot of experimentation there. The LanguageModelGenerator uses another good package so you could probably just try out whether its output is preferable and then call it a day.
Build a acoustic model model with just the numbers 1-10
Don’t you need 1-100? But you might want to investigate this approach and/or adapting the existing model with your new data: http://cmusphinx.sourceforge.net/wiki/tutorialadapt
It seems like the task of creating an acoustic model that just recognizes 1-100 with a number of different voice contributors and accents is constrained enough to be feasible for an app project.
Using JSGF instead ARPA
In my opinion after some recent experimentation, JSGF is too slow for a good UX. Other developers do use it so as I said this is a matter of opinion. You can use the garbage loop approach for out of vocabulary rejection as well with ARPA as with JSGF: http://sourceforge.net/p/cmusphinx/discussion/help/thread/cefe4df3 which could be something that improves your results if the issue is too many false positives rather than too many false negatives or transposed recognitions.