Custom dictionary speech recognition only kind of working.

Tagged: custom dictionary, recognition, speech

This topic has 5 replies, 2 voices, and was last updated 9 years, 10 months ago by Halle Winkler.

Viewing 6 posts - 1 through 6 (of 6 total)

Advertisement: “Did you know OpenEars™ can use rules-based grammars to recognize fixed phrases? And RuleORama lets you use them with RapidEars!”

Author

Posts
May 30, 2014 at 2:20 pm #1021440
davestrand
Participant
Hello,

I am sifting through the other forum posts, and can’t seem to resolve my issue..

I seem to be having problems with voice recognition of certain words. My app is designed to update the vocabulary based on the current situation. Usually that means it changes the available vocabulary every few seconds, but occasionally it might fire off two vocabulary changes within a second. At any rate, it appears to be switching vocabularies, but many words don’t seem to be understood (Like NO or TAVERN). I do have it generating a random string to name each vocabulary change, and can verify that it’s creating these in the Cache and with the “Pocketsphinx is now using the following language model:” log, it does appear to be logging the correct dictionary and path.

2014-05-30 06:00:29.054 VoiceGame[2522:60b] Pocketsphinx is now using the following language model:
/var/mobile/Applications/DA1C44EA-09CE-4714-83FA-B72B42E5DD13/Library/Caches/K7do2JmCR9j8BXfuWbzk.DMP and the following dictionary: /var/mobile/Applications/DA1C44EA-09CE-4714-83FA-B72B42E5DD13/Library/Caches/K7do2JmCR9j8BXfuWbzk.dic

I followed through with your tutorial, as well as using code suggested by you to me, here on the forum. But, it’s probably something with my code that is misplaced or called incorrectly. If you could help me locate this error I would be most grateful.

In ViewDid Load I have…

`//[OpenEarsLogging startOpenEarsLogging]; // Uncomment me for OpenEarsLogging

[self.openEarsEventsObserver setDelegate:self];

[self reLoadLanguageBasedOnNewVariables]; //this is also setting up startListening below.

[self startListening];

[self startDisplayingLevels];
//I’ll probably get rid of or modify these buttons eventually.
self.startButton.hidden = TRUE;
self.stopButton.hidden = TRUE;
self.suspendListeningButton.hidden = TRUE;
self.resumeListeningButton.hidden = TRUE;`

And I made a method to load the new language when variables change…
```
- (void) reLoadLanguageBasedOnNewVariables {
     NSLog(@"RELOAD LANGUAGE");
    NSString *randomName = [self genRandStringLength:20];
    
    LanguageModelGenerator *lmGenerator = [[LanguageModelGenerator alloc] init]; 
    
    NSString* actionText = [player shipPlayerWordsAction];
    NSString* actionsLimited = [player shipActionOptionsLimited];
    NSString* exitsText = [player shipPlayerWordsExits];
    NSString* inventoryText = [player shipPlayerWordsInventory];
    NSString* objectsInRoomText = [player shipPlayerWordsObjectsInRoom];
    multipleChoiceWords = [player shipMultipleChoiceWords];
    NSArray* languageArray1;
    
    if (inConversation) {
        //NSLog(@"set array for conversation vocab");
        languageArray1 = [[NSArray alloc] initWithArray:[NSArray arrayWithObjects: // All capital letters.
                                                        actionsLimited,
                                                        multipleChoiceWords,
                                                        nil]];
        
        
        
    } else {
        
        //NSLog(@"set array for conversation vocab");
        languageArray1 = [[NSArray alloc] initWithArray:[NSArray arrayWithObjects: // All capital letters.
                                                              actionText,
                                                              exitsText,
                                                              inventoryText,
                                                              objectsInRoomText,
                                                              nil]];
    }
    
   // NSLog(@"language array is now %@.",languageArray1);
    
    //lets try to bring in a random filename and create new dic files for each movement
    NSString *name = randomName;
    
    NSError *err = [lmGenerator generateLanguageModelFromArray:languageArray1 withFilesNamed:name forAcousticModelAtPath:[AcousticModel pathToModel:@"AcousticModelEnglish"]]; //<<<CONFIRMED //might slow down
    
   // NSDictionary *dynamicLanguageGenerationResultsDictionary = nil;
    
    NSDictionary *languageGeneratorResults = nil; 
    
    lmPath = nil; 
    dicPath = nil; 
    
	
    if([err code] == noErr) { 
        
        languageGeneratorResults = [err userInfo]; 
        
        lmPath = [languageGeneratorResults objectForKey:@"LMPath"]; 
        dicPath = [languageGeneratorResults objectForKey:@"DictionaryPath"]; 
    
        // NSLog(@"the dic path here is %@",dicPath); //correct path, but the switch goes to older path
        self.pathToDynamicallyGeneratedGrammar = lmPath; // We'll set our new .languagemodel file
		self.pathToDynamicallyGeneratedDictionary = dicPath; // We'll set our new dictionary
        
        [self.pocketsphinxController changeLanguageModelToFile:lmPath withDictionary:dicPath]; //This is the piece that actually saves the new files in the cache
        
        //[self pocketsphinxDidChangeLanguageModelToFile:lmPath andDictionary:dicPath]; //This is the second time this is called, I think the first time happens automagically, but now it's at least pulling the correct path. Correction, now it's no longer pulling the latest path.
        
    } else { 
        NSLog(@"Error: %@",[err localizedDescription]);
    }
    
}

On ViewDidLoad it also triggers this startListening method..
- (void) startListening {

    [self.pocketsphinxController startListeningWithLanguageModelAtPath:lmPath dictionaryAtPath:dicPath acousticModelAtPath:[AcousticModel pathToModel:@"AcousticModelEnglish"] languageModelIsJSGF:NO]; // Change "AcousticModelEnglish" to "AcousticModelSpanish" to perform Spanish recognition instead of English.
    
    //NOT SURE IF NEEDED speechRecognitionBeforePause = TRUE; // not confirmed
}
```
And then I occasionally call the [self reLoadLanguageBasedOnNewVariables]; when there is new vocabulary to understand. Every time the new language is launched, I do see the log indicate that “Pocketsphinx is now using the following language model” and can verify that in my cache the correct words seem to be listed inside of each dic… but for some reason it doesn’t seem to understand many of those words.

In a sample of my dic files it shows
```
INVENTORY	IH N V AH N T AO R IY
NO	N OW
PAUSE	P AO Z
SAY	S EY
UNPAUSE	AH N P AO Z 
YES	Y EH S
```
May 31, 2014 at 10:16 am #1021456

Halle Winkler
Politepix

This sounds like a bit of a complex issue because of this line:

Usually that means it changes the available vocabulary every few seconds, but occasionally it might fire off two vocabulary changes within a second.

It sounds like there is some kind of event which is external to the progress of the speech recognition that causes a sudden vocabulary change or repeated changes and I can imagine that there could be some unexpected results of two arbitrary changes in very close proximity. Model changing is very fast but it does take a bit of real time and it involves two interdependent files, so I can think of a couple ways this could not work out.

My general advice is to reexamine this design since it’s a little bit at odds with my design assumption that most vocabulary changes will occur as a result of a recognition event or an interaction event driven by the user rather than an external event that could lead to multiple changes inside of a second (doesn’t mean my assumption is correct and your design is incorrect, just that that is the assumption, right or wrong, and the design may work against it and raise more issues than just this one, even if it’s an interesting design).

But we can also try to troubleshoot it to see if it is unrelated to the model switching. I think my first interest in troubleshooting is to take one of the DMP/dic pairs that is giving weird results and testing it alone in the sample app with no switching, so you can verify that it works well in the absence of other issues possibly related to timing. Recognition of individual words can be impaired for a couple of reasons even if they appear in the respective files, in the case that there is some kind of issue with the phonetic transcription or if there is a very similar-sounding other word in the vocabulary.

To look into this further, can you isolate a DMP/.dic pair in which some of the words are understood well and others aren’t which manifests the issue when used as the starting language model for the sample app, so we can examine it? If I’m not mistaken you should also have an .arpa file generated at the same time which will show you the probability model so let’s take a look at that too.

May 31, 2014 at 3:14 pm #1021457

davestrand
Participant

Sounds like a plan. I think we will find out that even the slower vocabulary transitions have some kind of error. You are correct, there are three files created, the arpa, dic, and DMP. I’m not too sure how to inspect them, so I have uploaded them to a folder for you to scope out.

In this scenario, I am trying to say the word NO, but it always thinks I am saying YES. No means no, right?

Here are the files:
http://secret.strandland.com/halle/

and..
Users/davelevy/Library/Application Support/iPhone Simulator/7.1-64/Applications/E3667FB5-52D3-4F18-9C54-D9ED81BB304C/Library/Caches/JxCJL6ji2xJYlBdpWhFI.DMP and the following dictionary: /Users/davelevy/Library/Application Support/iPhone Simulator/7.1-64/Applications/E3667FB5-52D3-4F18-9C54-D9ED81BB304C/Library/Caches/JxCJL6ji2xJYlBdpWhFI.dic

May 31, 2014 at 3:52 pm #1021458
Halle Winkler
Politepix
OK, here is what jumps out at me right off the bat:

1. I looked at the ARPA model and it is being calculated correctly given its input, and the phonetic dictionary looks normal to me and doesn’t seem to be missing alternate pronunciation, so I don’t expect anything bad due to that. The files not existing or being malformed seems like something we can rule out since they seem to be mathematically and syntactically correct and as you’ve shown, they are found by the app and some of your utterances are recognized.

2. The commas that the words are being submitted with as a separator have no role in your language model and can only be confusing things, so I would remove them. The ARPA model shows many bigrams and trigrams (sequences of two or three words) with an extraneous comma separating the words in the utterance. They maaaybe have no negative effect, but they could be making things weird and they are definitely not doing anything useful such as providing contextual information to the engine.

3. Unless you are expecting a user to make the utterance “NO YES”, there shouldn’t be a phrase like “NO, YES” being submitted to LanguageModelGenerator, because it is a signal that “NO YES” is an expected utterance and the rest of the model math will include this assumption versus other possibilities. i.e. if it isn’t a real thing users will say, it is going to make your model less accurate overall. I think maybe this is happening unintentionally due to the text-munging that precedes creating these models with LanguageModelGenerator – it is probably supposed to separate your words into separate NSStrings but instead is making one big string with comma separators and giving that string to LanguageModelGenerator, which then treats the comma separators as important and treats all the words as a single sequential expected utterance.

Examples of this phenomenon from your ARPA model, which means that individual strings consisting of multiple words separated by a comma are getting submitted:
```
\2-grams:
-0.6021 <s> NO, 0.0000
-0.6021 <s> PAUSE, 0.0000
-0.3010 INVENTORY, UNPAUSE 0.0000 
4. Never test recognition using the Simulator – it can lead you into troubleshooting phantoms. Only test recognition and accuracy issues on a real device.
Suggested plan of action: fix #2 and #3 by making sure that your text preprocessing before submitting your NSArray to LanguageModelGenerator includes getting rid of comma separators and splitting words on either side of a comma into separate NSStrings. This may indirectly result in an improvement to the quality of the phonetic dictionary as well, we'll see. Then test on a physical device and see if the situation is improved, and let me know your results.

		
	
```
June 1, 2014 at 1:35 pm #1021471

davestrand
Participant

Removing the commas made the voice recognition work a million times better. Thank you so much! The commas were in there for the visual element of the app, I didn’t know they would impact the speech recognition. I will also look into the “NO YES” stuff as well. :) Thanks a million!

June 1, 2014 at 2:17 pm #1021472

Halle Winkler
Politepix

You’re welcome!
Author

Posts

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic.