Forum Replies Created
OK, just for future reference there is a FAQ here where some similar questions are answered: https://www.politepix.com/openears/support
The intention of the simulator driver is basically to let you debug everything else in your app in the simulator without Pocketsphinx breaking — that is pretty much the full extent of its ambitions :) .
The device driver is tuned for Pocketsphinx and the iOS audio unit so it will be fast and reliable under high load with a small memory footprint (and so it won’t decrease the lifetime of the device flash drive by reading/writing continuously when Pocketsphinx is running). It can’t be translated into a simulator driver that works identically with all the different desktop devices that it might host without that becoming its own project, and I’d rather put the time into the device code. I also suspect that if the simulator driver were really good, some developers might test on it and be really surprised by the real-world performance of their app.
Hang on, when you say simulator performance, are you talking about performance or word accuracy rates? You will see terrible accuracy if you only test on the simulator, there are warnings about it everywhere.
OK, so here is the first recognition that I looked at in your log:
2012-05-07 21:14:14.005 2Sign-test[3860:13c07] OPENEARSLOGGING: Processing speech, please wait…
2012-05-07 21:14:14.037 2Sign-test[3860:13c07] OPENEARSLOGGING: Pocketsphinx heard “SALI SALI” with a score of (-12604) and an utterance ID of 000000001.
This is 32 milliseconds. Is it possible that you are looking at the wrong metrics for processing time?
By the way, I would really appreciate your troubleshooting this with me on a device instead of the simulator, since it is not the same audio driver and it’s not useful information about how the library performs in real use. What devices have you used the library with when you’ve seen the performance issue you’ve described?
The big performance difference is 100% to be expected — it’s the wrong expectation to see anything comparable. Practically, you’re comparing the machine on your desktop and a desktop from the 1990s, so there’s no discovery to make there other than that there is much less horsepower and RAM available for your iOS task.
I will check out your logs. The alphabetization issue could be important because it is part of the specification for the dictionary and the ARPA model that they are alphabetized. It is something that could be a speed issue on the device, because the desktop performance will be basically realtime which means that you won’t notice anything that negatively impacts performance there. What could be happening is that the search method is supposed to short-circuit once it goes beyond the alphabet range, but it can’t because there is no predictable alphabet range for your model so the whole thing has to be searched for every search path. Again, you might not see any issue with this on the desktop because it is all happening so quickly that de-optimizations don’t show up in real usage.
I again suggest that you check out a similar English model on the device so that you can see what an average speed response is. For the days of the week in English, I would expect a near-realtime response from OpenEars after the obligatory silence detection period.
Can I ask you for some clarification about this question?
The first thing comes to my mind is whether the format of recorded speech data in Objective-C affects the result.
Are you asking about the format of your acoustic model recording or the input format on the iPhone? The input format of speech is always PCM and it is only ever stored in memory in a ringbuffer, so there is no format question there or any reading/writing to or from disk, it’s just the raw samples that are streamed from the mic being read out of a C buffer. It’s about as low-latency as it can be without requiring so many callbacks per second that it would hurt performance.
Hmm, ARPA mobile performance on my end is usually fast. My test language model of approximately 400 words returns responses for long sentence dictation on an iPhone 4 in about 1-4 seconds, which is fast considering how constrained the CPU and RAM are on the device versus any desktop.
All the audio and processing code is in C whether written by me or as part of one of the dependencies. I’m pretty sure after a year and a half that there is nothing “wrong” per se with the framework performance, but it could be that you are encountering an issue with your application.
I would say that I’d need some timing examples with information about the following items:
ARPA or JSGF?
Is the acoustic model 8-bit or 16-bit?
What is the size of the language model or grammar?
How far away from the phone are you when speaking? (this has a huge effect for dictation applications)
What are you trying to recognize? Words, sentences, phrases?
Please show some representative time results along with the device they are taken from (not simulator). Feel free to show logs that have both OPENEARSLOGGING and VERBOSEPOCKETSPHINX turned on so it is possible to see the exact processing times once complete speech has been detected until the hypothesis is returned. I remember that the last time there was a similar question about ARPA, it turned out that something in the app was blocking recognition (although JSGF performance is unfortunately poor on the device, so if it is actually a JSGF grammar, that will be slow).
My other question is whether you have tested similar size models in English on a device, so you can rule out that it is connected to your acoustic or language models.
I don’t really think device vs. desktop is a fruitful discussion, since it is the expected result to see really dramatic performance differences there (or all the other implementations of speech recognition for the iPhone would not be server-based). But I’d be interested to know what kind of device performance you are seeing and also how it compares to similar-sized models in English, in case there is something specific to the Turkish model or implementation that is slowing things down.
Correct, you need to create some method for parsing your text into the format that you need spoken. Flite doesn’t have a built-in system for detecting what kind of a number it is being given (this is a good thing since it would lead to wrong assumptions). Without knowing much about your requirements, something that is fairly simple might be to just try to detect the numbers 1-9 in your text string and replace them with the equivalent word (e.g. replace “1” with “one”, etc). But if you don’t know much about the text that is coming in, or you know that a “1” could mean more than one thing if it occurs in your text, this approach might be too primitive.
Sure, this is just an implementation detail for you since you are controlling the string that is passed to FliteController. Instead of passing FliteController the word “1200”, pass “1” then “2” then “0” then “0”, or pass “one two zero zero”.
Very cool, thanks for describing this in detail.
None that are part of the basic functionality of the framework as supported here, sorry.
Basically, the situation with this is that there is one microphone (in most devices), so there is one stream of samples, and OpenEars’ RemoteIO driver is consuming them. So they aren’t available for something else (like an AVAudioRecorder) to consume.April 15, 2012 at 8:17 am in reply to: [Resolved] After used OpenEars,AudioServicesPlaySystemSound can't work #9332
Awesome, very glad to hear it.April 14, 2012 at 7:51 am in reply to: [Resolved] After used OpenEars,AudioServicesPlaySystemSound can't work #9328
OpenEars uses the AudioToolbox.framework so it has to be compatible with it. Can you tell me which version of OpenEars you are using? This might have something to do with the new AudioSessionManager behavior in 1.0/1.01.
Weird, the last time I tested it as audio type background app it didn’t work. It’s been a few versions though, I can believe it’s changed.
I guess I would suggest that because the AudioSessionManager has been revised, which probably relates to your question. When you update, be sure to read the docs and be sure to no longer directly invoke AudioSessionManager, since that is now internal to the framework.
If the main intention of the idea is to always have a template-style sentence with variable numbers inserted in a fixed position, you want to investigate how to write JSGF rules and use a JSGF grammar with OpenEars rather than an ARPA language model.
Have you tried creating a language model which uses the number-related words you’d like to recognize? e.g. “ONE”, “TWO”, [….] , “TWENTY”, [….], “HUNDRED EIGHTY”, [….]
This is what would be necessary if you wanted to mix in numbers with other speech. If you want to recognize numbers exclusively, you’d want to replace your acoustic model with the tidigits acoustic model (and also create a language model that contains the spoken versions of the numbers you want to be able to recognize).March 6, 2012 at 9:11 am in reply to: How can I get the N-Best list for single word pronunciation? #8712
 OpenEars has N-Best hypotheses and scoring since version 1.1.
Well, it isn’t that they don’t run, it’s that they cut down on notifications and timer processes to save battery. I think it’s probably more about enduser expectations about what happens when they press the lock button than Apple being controlling, since pocket dialing is probably not a big worry due to the touch interface.
I don’t know offhand so I’d just recommend trying it on the sample app, but the answer is most likely no for the reasons explained in this StackOverflow question:
Without being able to react to events, it’s unlikely that a speech recognition engine is going to do much.
Thanks! Always happy to hear that it can do things I don’t expect it to be able to do :) .
To the best of my knowledge, the only consideration is whether there is an open stream from or to a device speaker or mic at the time of backgrounding; audio session is not a factor.
If so, would there be a way to save the speech to an audio file and link it to a local notification?
I can imagine that this _might_ be possible, but it’s probably going to be a complex implementation that isn’t going to make use of many of the conveniences of the original code. I can’t give pointers on how to do it, sorry.
I’m pretty doubtful that you can initiate new audio while in the background. I think you’re limited to continuing an audio stream you’ve already started before going into the background.
This is known as out-of-vocabulary rejection and it’s something you’d need to inquire about at the CMU sphinx board since it is primarily a pocketsphinx question.
Nope, sorry, this isn’t possible. I haven’t pursued it because I don’t think it’s background behavior that Apple intends to accept. Thanks for being straightforward about it though, since it’s frustrating when people ask for this feature under the guise of asking for a different feature.
I don’t think it would be helpful to give any advice other than simply testing the language sizes you are speculating about and see firsthand what results you get. There are a few variables even within a total number of recognizable hypotheses so it’s better to verify how it performs.
Only insofar as any code can be modified to do new things, but what you are talking about isn’t a quick drop-in of some code snippets kind of a situation, it’s nontrivial restructuring in more than one class using more than one language. Search for “partial hypothesis” on the CMU Sphinx forum to get some good ideas about where to get started. You might find that it simplifies things to use the old 0.902 version of OpenEars for experiments with this behavior so that you don’t need to worry about testing the ringbuffer in the audio driver as well.
OK, that makes sense. This just isn’t a feature of OpenEars.
OK, I don’t think I quite understand yet why there is recognition needed both for the full sentence and just for a keyword in it while it is underway. Can you give an example? The framework does recognition after the user has paused in speech for the duration in kSecondsOfSilenceToDetect. That’s the only trigger method for beginning recognition that is supported in its API.
Sorry, this isn’t currently a feature of OpenEars. Generally, for long dictation tasks you are probably going to see best results with a server-based service. Even with live recognition, the end recognition result would probably be the same since it would be resolving the overall utterance using the same engine, acoustic model and language model. Based on my experimentation the advantages for long dictation tasks would be in better UI feedback (but doing speech-based correction on a single error in a long dictation is not fun for the user at all) but not notably different recognition results. The advantages of this kind of recognition are much bigger for shorter command-and-control models because reaction time will be faster for whatever you are controlling in addition to the UI advantage.
Glad to hear it!
OK, that isn’t yet implemented.
Is there a reason you can’t suspend listening before playing audio and resume afterwards? Suspend and resume are near-instant.December 2, 2011 at 9:31 am in reply to: [Resolved] How to reject out-of-vocabulary utterances? #8211
Since these are questions about the implementation of Pocketsphinx it would be better to ask them at the CMU Sphinx site.December 1, 2011 at 1:00 pm in reply to: [Resolved] How to reject out-of-vocabulary utterances? #8199
This is the OOV utterance problem — here is the relevant Pocketsphinx FAQ entry:
Joseph, what do you think about 4) tuning an LM by significantly raising the probabilities of desired trigrams or bigrams?
Whoops, I see you already explained where the FSG tool is. I think I was just having an out-of-body experience due to you casting aspersions on the One True Computer (I keep my linux installs safely on my servers where they can’t interfere with real work ;) ).
Joseph, thanks very much for the thorough answer. I agree that this would make a good FAQ entry, I will add it (or a more concise version of it) when I have a moment. I think the advice I should be giving for now is that users output an FSG using the appropriate tool (is it part of CMULTK?).
That does not seem typical to me, but maybe one of the JSGF users around here can weigh in.
(What I mean is, your requirements may mean that you have to look at server-based solutions).
OK, you might just find that a grammar with phrases comprising 3000 words just isn’t that performant on the hardware. The limits of in-device processing are far lower than what can be done server-side. But it’s definitely worth trying with the plain implementation just to see if there is a configuration issue.
What I meant was, when the hypothesis is calculated, will it allow for given phrases to be recombined — I figured out that the answer is “only if i want them to be”
It still isn’t clear to me what your goal is here.
Out of curiosity, how large is your language model/grammar? 8-10x processing sounds really unusual to me, as does reduced accuracy for JSGF. A memory warning at calibration is also very unusual in my experience, especially on a device that recent.
Have you changed anything in your setup to make it not be stock? I.e. are you using Pocketsphinx .7 or have you made changes to the library? Question 2 is if you get the same results when using your grammar with the sample app, without making any other changes.
Would changing the grammar to JSGF help avoid this at all?
If you have some kind of logical pattern that lets you set a rule, and the rule will only allow “eight” instead of “eighty” when it’s [preceded by x || followed by y || whatevered by whatever], it will help. If you don’t have the opportunity to use logic to rule out similar-sounding sentence parts it will help less, but it will still probably be helpful to restrict recognition to a subset of phrases you expect.
I don’t really understand this question fully, can you give an example:
Also, it is my understanding that a JSGF grammar would allow multiple language model phrases in a single hypothesis, which would be necessary for my project.
BTW, if you search around there is some alpha code for dynamic switching between JSGF grammars posted in another discussion here but I don’t know offhand what the link is.
This is a case for switching over to a JSGF grammar instead of an ARPA language model. There are links to some JSGF resources in the docs and there are some questions already in these forums which touch on this, but the CMU Sphinx forum might be an even better resource.
I deleted a thread about this earlier this week because it turned into a sort of tech support re-enactment of Heart of Darkness, but here is my answer to its initial question of where to get started with JSGF:
You can give the PocketsphinxController recognizer either a .languagemodel file or a .gram file depending on whether you want to use an ARPA model or a JSGF grammar. To see an over-simplified example of a .gram file for OpenEars, you can download previous version 0.902 here and look at the .gram file included with it:
To see a somewhat more complex example of a .gram file look at [OPENEARS]/CMULibraries/sphinxbase-0.6.1/test/unit/test_fsg/polite.gram
You can probably also find more .gram files in the sphinxbase and pocketsphinx folders in CMULibraries.
One limitation is that you can’t use JSGFs which import other rules using at the top.
A .gram file still needs the corresponding phonetic dictionary .dic file in order to function. It is obviously necessary to run the startListening: method with JSGF:TRUE at the end. Using JSGF means that you can’t switch dynamically between grammars while the recognizer is running in the current version of OpenEars like you can with ARPA models.
Here is documentation for the JSGF format so you can write your own rules:September 30, 2011 at 10:40 am in reply to: How can I get the N-Best list for single word pronunciation? #7644
Check out the documentation for LanguageModelGenerator here:
The issue with the Voxforge models is that they aren’t license-compatible with App Store distribution.
Raw phonemes is something that I think only Sphinx 3 does, and IIRC with several caveats. I believe that the task you are doing is known to be very difficult to get good results for. There is an acoustic model called tidgits in [OPENEARS]/CMULibraries/pocketsphinx-0.6.1/model/hmm/en/tidigits with an accompanying language model in [OPENEARS]/CMULibraries/pocketsphinx-0.6.1/model/lm/en that I think is specifically oriented towards recognizing numbers that you might want to try instead of hub4wsj_sc_8k and your custom LM, although I’ve never used it myself so I can’t make any promises.September 1, 2011 at 1:13 pm in reply to: Does OpenEars require internet access to when using? #7561
Correct, it does nothing over the network.
Oh, something else to keep in mind is that the Flite phonemes do not map identically to the US English Pocketsphinx phonemes. I think that spending a little bit of time looking at the implementation of GraphemeGenerator from the library should probably point up any important differences, since I know it was an issue I needed to solve in order to set up the fallback pronunciation technique for creating a dictionary entry when the word isn’t found in the main dictionary file. It might be as minor as AX -> AH, just take a look (keeping in mind that you are doing the opposite process as GraphemeGenerator by taking a Pocketsphinx phoneme and converting it into a Flite one).
I have looked into it briefly in the course of my own attempts to improve the pronunciation in AllEars, but I don’t have functional code to share with you because I ended up choosing to not do this with Flite. I can’t remember what the exact reason was but early on in the troubleshooting process I decided to keep it simpler.
My not-finished idea was to give Flite SSML text:
Which it is meant to support. I can’t remember if I couldn’t get it working, or got it partially working and then found out that it didn’t give fine-grained enough control, or what the issue was.
You can give Flite SSML input as a text file with the function:
float flite_ssml_to_speech(const char *filename,
const char *outtype)
Which you can search the library for. Getting the output over to the FliteController method as a waveform I have no advice on, but I’m sure it’s possible. Hope this is helpful.
There is also a function for giving Flite explicit phones:
float flite_phones_to_speech(const char *text,
const char *outtype)
Which may be helpful if there is some method that I haven’t come across for marking a phone for emphasis, although my recollection is that it results in equal emphasis for all phonemes, i.e. robot voice.
A good place to look for implementation templates is the main() of flite.c that is part of the Flite download.
The simplest way to fake out Flite with an alternate pronunciation, if you have the luxury of knowing the words needed in advance, is to have Flite say a series of single-syllable words with the phonemes you actually want. I did this many times in AllEars because the Flite voice I chose didn’t say email, iPhone or mobile in a way that I expected to be immediately clear to the user. So instead I gave it @”eem ale”, @”eye phone”, @”moe bile” which worked out pretty well. Is that kind of workaround an option for you?
I’d have to recommend that you bring your Pocketsphinx functionality questions to the Pocketsphinx support board since they are the originators and maintainers of the code you have questions about. You’re always welcome to ask me questions about accessing Pocketsphinx functions you are already making use of in C or on the command line via Objective-C, but unfortunately I can’t offer general “how do I handle research task x with Pocketsphinx” support here since it is such a broad subject and one with its own support system.
OK, I’m still up for taking a look and maybe posting sample code for returning it through OpenEars if akaniklaus has some known-working C code for this later. This is unlikely to become a feature of OpenEars because it’s (IMO) esoteric, but it is probably easy to patch in a new OpenEarsEventsObserver delegate method.
OK, so one thing you can do for more speed in the case of trying to do real-time game control is to reduce the silence time in OpenEarsConfig.h. If your commands are just single words, you might be able to get it pretty low without it being too “hair-trigger”. Trial and error should show how low you can reduce it.
i just wanted to distinguish between two voices in an game like fun application, as OpenEars is able to detect the word spoken, if i will be able to figure to extend or already the capability exits to distinguish two voices, my project will be lot easier, its just a word play between two ppl.
Gotcha. I think you won’t be able to accurately distinguish between two users because you don’t know in advance how similar their voices will be, so it really would take you into acoustic fingerprinting territory.
You’re welcome, although it occurs to me that maybe I’m not understanding your exact issue — is it accuracy or is it speed? Accuracy would be the percentage of correct word matches without any regard for how long it takes. Speed would be how fast the recognition is returned without any regard for how correct it is.
In my experience, with such a small language model combined with OpenEars .91, recognition on the device is basically instantaneous, maybe with a little bit of lag on an iPhone 3G or first-gen iPod. But before recognition begins, silence has to be detected, and the amount of silence that is detected in order to start recognition is a fixed amount that is set in OpenEarsConfig.h. So that amount of time is non-negotiable because it functions as the signal that the user is done saying their word or phrase. If it feels slow, maybe it is due to this obligatory silence, or maybe there is another issue. You can time how fast the actual recognition is performed by using the new delegate methods of OpenEarsEventsObserver for .91.
Hmm, I don’t think there is a reliable way to distinguish between the voices just using OpenEars. Even if there were, it would break down if you had two users with very similar timbres and accents unless it were very granular. It sounds like you’re interested in acoustic fingerprinting, but from this discussion it looks like a bit of a no-go:
This might be a case for just using some kind of interface/logic solution such as having each user tap a button when they want to speak, or log in at the beginning of a session so you know who they are.
Thanks for moving this over here and giving me a bit more background. So, the way that you restrict the possible recognition is via the use of a small language model, which it sounds like you’ve already done correctly. You could only change the mdef and sendump files by retraining the acoustic model, which I don’t think would give you results you’d be happy with, but if it is something you are very interested in, you can ask (very precise) questions about adapting the acoustic model at the CMU Pocketsphinx forums after reading the introduction here and trying to execute its instructions on your own: http://cmusphinx.sourceforge.net/wiki/tutorialadapt. I don’t recommend it if the plan is to have fast recognition across many different accents and under different recording environments, though.
If you want to let me know what words are being confused for each other, maybe I can make some suggestions. As far as making the app smaller, there are several suggestions here: https://www.politepix.com/openears/support/#13
Can you elaborate on what the purpose for distinguishing between the two voices would be? I don’t think there is any way to do that, but maybe I can make a suggestion if I know what concept behind it is.
I hope this is helpful,
OK, this is actually pretty far outside of the OpenEars purview since it isn’t really designed as a research tool that propagates every available Pocketsphinx function through to an Objective-C interface, but it should be no problem getting it answered as a Pocketsphinx question over in the Sphinx forums as long as you are very precise:
I answer these ones occasionally when it’s fairly self-evident in the Pocketsphinx codebase, but this one is not to me (and if I’m not mistaken, it also looks like it might be different for LMs versus FSGs) so it’s probably a better call to go to the source. Once you know the C implementation details, I can help you with the details of returning the data through OpenEars if needed.
Can you describe more specifically which aspect of that task you are having difficulty with and in what area of the implementation? If I were doing this I guess I would put the utterances in an NSMutableDictionary as they come in, or possible write them out to a plist or some combination of the above depending on the eventual use of the data. I presume that you know the expected recognition, and the utterance id (a good key for the dictionary entry), the hypothesis, and the confidence score are all delivered by the OpenEarsEventsObserver method.
Just as an update on this, I’ve been gradually learning that individual letters/syllables are a challenging case and expectations for accuracy should probably be lower than for whole word or phrase recognition.
OK, I think the first easy step is to get rid of the pronunciations that are in the dictionary that you definitely don’t want to recognize. I realize this isn’t at all self-evident so I’ll explain briefly. If you look at this from the dictionary:
That means that the language model tool gave you back two possible pronunciations for the word A. The first one is the particular NA pronunciation of the article “a” as in “a dog barked” that rhymes with “huh”. Since you don’t ever want to recognize that pronunciation of “a” because the alphabet character is never pronounced that way, you should erase that pronunciation from your dictionary.
The (2) in parentheses just means that it is the second pronunciation of the word, so the way you would want to replace
is with the line
deleting the first pronunciation, and removing the (2) from the second pronunciation since it is now the only pronunciation you are going to accept.
The next thing that you can do is to make the sentence “A B” part of your corpus. The corpus can have individual words, but it can also contain combinations of words. Combinations of words that you have made part of your corpus will have an automatically higher probability of being detected.
So, the corpus would say something like this:
You can do this for all of the possible combinations if you want to, or just the ones where you want to raise their probability of being detected. When you look at the language model that is output, you will see that there is a 2-gram entry for A B and that it has a raised probability.
Can I also see the dictionary? I’m surprised to hear that it is recognizing EIGHT Y for A B; the EIGHT isn’t surprising but the Y is. Is EIGHT Y an accurate transcription of what Pocketsphinx heard? What is the hypothesis (verbatim)?
Would it be possible for you to show me your language model?
Does the app read letters and numbers or recognize them in the user’s speech?