OpenEars 1.7: introducing dynamic grammar generation

10 April

UPDATE: The CMU Sphinx project of Carnegie Mellon University’s Speech department seems to like the new grammar format! Thanks!

Last September I was at iOSDevUK, which is a lovely iOS developer conference on the Welsh coast at the University of Aberystwyth, and a friend asked me what the next feature for OpenEars would be after Spanish support.

I said that before I added anything else, I needed to get rid of technical debt and fix all the known bugs in OpenEars and the plugins, and unbeknownst to me there was also a 64-bit architecture change looming on the horizon. So, in reality it ended up being “get everything 64-bit functional, pay off some technical debt, fix all the known bugs, then fix the bugs I introduced while fixing the bugs.” That last one was OpenEars 1.66 that came out last week.

There was another answer I wanted to give because there’s been something nagging at me for about the last year, but it was very clear that after all of the new technology additions in 2012 and 2013 to Politepix’s line it was time to do some maintenance and make sure things would be sustainable.

But something really was nagging at me.

Since February 2013 I’ve been extremely lucky to get to give some talks at industry conferences about speech recognition and speech synthesis as a human interface for mobile apps.

In my talk I always mention that for an app with a small vocabulary, one element of good interface design is deciding whether you want to use a probability-based language model such as an ARPA model, or a rules-based grammar such as a JSGF grammar, because choosing the right one for your use case has a large potential for improving user experience. This is a reasonable enough observation because Pocketsphinx supports the JSGF format so OpenEars does as well.

However, I always felt a bit shabby pointing this out and then leaving developers to write their own JSGF, which is a relatively complex and unpretty format – with imposing documentation – when you write it by hand, and which also has a few minor differences between the complete format definition and the implementation in OpenEars that can lead to difficult troubleshooting, while ARPA probability model generation in OpenEars is as easy as putting NSStrings in an NSArray.

So, since last September, a little bit here and a little bit there in between the important bugfix and architecture releases, I’ve been working on that thing that was nagging at me: dynamic generation of JSGF grammars using clear, human-friendly language and NSObjects. Today I am very happy to announce the first version of this feature in OpenEars 1.7.

You’ll find a new method in LanguageModelGenerator which is a counterpart to this ARPA-generation method:

- (NSError *) generateLanguageModelFromArray:(NSArray *)languageModelArray withFilesNamed:(NSString *)fileName forAcousticModelAtPath:(NSString *)acousticModelPath;

and the new method is called:

- (NSError *) generateGrammarFromDictionary:(NSDictionary *)grammarDictionary withFilesNamed:(NSString *)fileName forAcousticModelAtPath:(NSString *)acousticModelPath;

As you can see, instead of taking an NSArray it takes an NSDictionary. The NSDictionary you submit to the argument generateGrammarFromDictionary: is a key-value pair consisting of an NSArray of words stored in NSStrings indicating the vocabulary to be listened for, and an NSString key which is one of the following human-language constants defined in GrammarDefinitions.h, indicating the rule for the vocabulary in the NSArray:

ThisWillBeSaidOnce
ThisCanBeSaidOnce
ThisWillBeSaidWithOptionalRepetitions
ThisCanBeSaidWithOptionalRepetitions
OneOfTheseWillBeSaidOnce
OneOfTheseCanBeSaidOnce
OneOfTheseWillBeSaidWithOptionalRepetitions
OneOfTheseCanBeSaidWithOptionalRepetitions

Breaking them down one at a time for their specific meaning in defining a rule:

ThisWillBeSaidOnce // This indicates that the word or words in the array must be said (in sequence, in the case of multiple words), one time.
ThisCanBeSaidOnce // This indicates that the word or words in the array can be said (in sequence, in the case of multiple words), one time, but can also be omitted as a whole from the utterance.
ThisWillBeSaidWithOptionalRepetitions // This indicates that the word or words in the array must be said (in sequence, in the case of multiple words), one time or more.
ThisCanBeSaidWithOptionalRepetitions // This indicates that the word or words in the array can be said (in sequence, in the case of multiple words), one time or more, but can also be omitted as a whole from the utterance.
OneOfTheseWillBeSaidOnce // This indicates that exactly one selection from the words in the array must be said one time.
OneOfTheseCanBeSaidOnce // This indicates that exactly one selection from the words in the array can be said one time, but that all of the words can also be omitted from the utterance.
OneOfTheseWillBeSaidWithOptionalRepetitions // This indicates that exactly one selection from the words in the array must be said, one time or more.
OneOfTheseCanBeSaidWithOptionalRepetitions // This indicates that exactly one selection from the words in the array can be said, one time or more, but that all of the words can also be omitted from the utterance.

Since an NSString in these NSArrays can also be a phrase, references to words above should also be understood to apply to complete phrases when they are contained in a single NSString.

A key-value pair can also have NSDictionaries in the NSArray instead of NSStrings, or a mix of NSStrings and NSDictionaries, meaning that you can nest rules in other rules.

Here is an example of a complete, complex ruleset which can be submitted to the generateGrammarFromDictionary: argument. It is designed to be easily readable as a collection of English sentences, however I have also followed this version with another one that has a step by step explanation of each part:

 @{
     ThisWillBeSaidOnce : @[
         @{ OneOfTheseCanBeSaidOnce : @[@"HELLO COMPUTER", @"GREETINGS ROBOT"]},
         @{ OneOfTheseWillBeSaidOnce : @[@"DO THE FOLLOWING", @"INSTRUCTION"]},
         @{ OneOfTheseWillBeSaidOnce : @[@"GO", @"MOVE"]},
         @{ThisWillBeSaidWithOptionalRepetitions : @[
             @{ OneOfTheseWillBeSaidOnce : @[@"10", @"20",@"30"]}, 
             @{ OneOfTheseWillBeSaidOnce : @[@"LEFT", @"RIGHT", @"FORWARD"]}
         ]},
         @{ OneOfTheseWillBeSaidOnce : @[@"EXECUTE", @"DO IT"]},
         @{ ThisCanBeSaidOnce : @[@"THANK YOU"]}
     ]
 };

So that’s the whole thing – that is all it takes to create a complex ruleset in OpenEars 1.7.

Breaking it down step by step to explain exactly what the contents mean:

 @{
     ThisWillBeSaidOnce : @[ // This means that a valid utterance for this ruleset will obey all of the following rules in sequence in a single complete utterance:
         @{ OneOfTheseCanBeSaidOnce : @[@"HELLO COMPUTER", @"GREETINGS ROBOT"]}, // At the beginning of the utterance there is an optional statement. The optional statement can be either "HELLO COMPUTER" or "GREETINGS ROBOT" or it can be omitted.
         @{ OneOfTheseWillBeSaidOnce : @[@"DO THE FOLLOWING", @"INSTRUCTION"]}, // Next, an utterance will have exactly one of the following required statements: "DO THE FOLLOWING" or "INSTRUCTION".
         @{ OneOfTheseWillBeSaidOnce : @[@"GO", @"MOVE"]}, // Next, an utterance will have exactly one of the following required statements: "GO" or "MOVE"
         @{ThisWillBeSaidWithOptionalRepetitions : @[ // Next, an utterance will have a minimum of one statement of the following nested instructions, but can also accept multiple valid versions of the nested instructions:
             @{ OneOfTheseWillBeSaidOnce : @[@"10", @"20",@"30"]}, // Exactly one utterance of either the number "10", "20" or "30",
             @{ OneOfTheseWillBeSaidOnce : @[@"LEFT", @"RIGHT", @"FORWARD"]} // Followed by exactly one utterance of either the word "LEFT", "RIGHT", or "FORWARD".
         ]},
         @{ OneOfTheseWillBeSaidOnce : @[@"EXECUTE", @"DO IT"]}, // Next, an utterance must contain either the word "EXECUTE" or the phrase "DO IT",
         @{ ThisCanBeSaidOnce : @[@"THANK YOU"]} and there can be an optional single statement of the phrase "THANK YOU" at the end.
     ]
 };

So as examples, here are some sentences that this ruleset will report as hypotheses from user utterances:
"HELLO COMPUTER DO THE FOLLOWING GO 20 LEFT 30 RIGHT 10 FORWARD EXECUTE THANK YOU" "GREETINGS ROBOT DO THE FOLLOWING MOVE 10 FORWARD DO IT" "INSTRUCTION 20 LEFT 20 LEFT 20 LEFT 20 LEFT EXECUTE"
But it will not report hypotheses for sentences such as the following which are not allowed by the rules:
"HELLO COMPUTER HELLO COMPUTER" "MOVE 10" "GO RIGHT"

The last two arguments of the new LanguageModelGenerator method work identically to the equivalent language model method. The files created, instead of being .DMP and .dic like for an ARPA model, are .gram and .dic where the .gram is your JSGF. So now when you pass your .gram file to the Pocketsphinx method:

- (void) startListeningWithLanguageModelAtPath:(NSString *)languageModelPath dictionaryAtPath:(NSString *)dictionaryPath acousticModelAtPath:(NSString *)acousticModelPath languageModelIsJSGF:(BOOL)languageModelIsJSGF;

you will set the argument languageModelIsJSGF: to TRUE.

The goal of creating this API was not only to make it much easier to create a grammar (or multiple grammars to switch between) before runtime, but also to make an interface powerful and simple enough that you can build new grammars dynamically at runtime based on arbitrary input.

JSGF isn’t compatible with RapidEars, so I will also be releasing a new product shortly which will allow the same grammar language to be used to output RapidEars-compatible grammars as well, but this is a new type of thing altogether so it will be in testing for a while longer. The new product will also provide faster grammars for stock OpenEars since JSGF searching is a bit too resource-intensive for ideal responsiveness on 32-bit devices.

I am delighted to finally release this. There’s really nothing that makes me happier as a developer than to take something a little gnarly like JSGF and find a way to make it accessible and amenable to humanistic interface designs from developers who don’t necessarily have time or interest in specializing in speech interface to the extent that it can take to get to grips with a format like JSGF, so I hope I achieved that and that we’ll see it in some of your awesome apps.

It’s a 1.0 feature, so there will be some bumps and you should bring ’em right to the forums so I can take a look.

Thanks and enjoy your development,

Halle

UPDATE: I’ve finished the new plugin for generating fast grammars which are also compatible with RapidEars – it’s called RuleORama and you can read about it here and learn how to implement it using the tutorial.

Our Flying Friends