Limit hypothesis to phrases defined in model

This topic has 17 replies, 3 voices, and was last updated 10 years ago by Halle Winkler.

Viewing 18 posts - 1 through 18 (of 18 total)

Advertisement: “Don't want OpenEars™ to guess one of your vocabulary words when it hears an unknown word? Rejecto can help!”

Author

Posts
November 17, 2011 at 11:43 pm #8099

culov
Participant

Let’s say my dynamically generated language model is defined by the following array: [“ONE”, “MY SAMPLE SENTENCE”].

Many times, when I speak the phrase “My sample sentence” into my phone, the hypothesis returned will be “ONE SAMPLE SENTENCE”. What I want to be able to do is prevent the library from allowing fragments of inputs to be recombined to form new, acceptable phrases. In the example above, I imagine that if this were possible, when Open Ears is determining the hypothesis, if it were forced to choose between “MY SAMPLE SENTENCE” or “ONE”, it would make the right decision.

I’ve spent a couple hours looking through the source, but there were many aspects that I had trouble understanding. What I’d like to know is whether there is an easy way of accomplishing what I’d like, and if there isn’t perhaps some input on how difficult it might be to add such a feature myself and how I might want to go about it.

Thanks a lot

November 17, 2011 at 11:45 pm #8101

Halle Winkler
Politepix

This is a case for switching over to a JSGF grammar instead of an ARPA language model. There are links to some JSGF resources in the docs and there are some questions already in these forums which touch on this, but the CMU Sphinx forum might be an even better resource.

I deleted a thread about this earlier this week because it turned into a sort of tech support re-enactment of Heart of Darkness, but here is my answer to its initial question of where to get started with JSGF:

You can give the PocketsphinxController recognizer either a .languagemodel file or a .gram file depending on whether you want to use an ARPA model or a JSGF grammar. To see an over-simplified example of a .gram file for OpenEars, you can download previous version 0.902 here and look at the .gram file included with it:

https://www.politepix.com/wp-content/uploads/0.9.02.zip

To see a somewhat more complex example of a .gram file look at [OPENEARS]/CMULibraries/sphinxbase-0.6.1/test/unit/test_fsg/polite.gram

You can probably also find more .gram files in the sphinxbase and pocketsphinx folders in CMULibraries.

One limitation is that you can’t use JSGFs which import other rules using at the top.

A .gram file still needs the corresponding phonetic dictionary .dic file in order to function. It is obviously necessary to run the startListening: method with JSGF:TRUE at the end. Using JSGF means that you can’t switch dynamically between grammars while the recognizer is running in the current version of OpenEars like you can with ARPA models.

Here is documentation for the JSGF format so you can write your own rules:

https://en.wikipedia.org/wiki/JSGF
https://www.w3.org/TR/jsgf/
https://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/jsgf/JSGFGrammar.html

November 18, 2011 at 12:21 am #8103

culov
Participant

Thanks for the response Halle. I’m going finish reading the resources you provided, and also spend some time playing with sample JSGF grammars when I get on my development machine. I had a hunch that JSGF would be the way to go, but the added delay involved with stopping and starting the listener may rule it out for me, since my current implementation used 2 distinct ARPA models (about 1200 phrases each) that I frequently switch between. I think i could only make it work with my project if I combined both models into a JSGF grammar thus eliminating the need to switch language models. My dilemma then is deciding whether I could produce more successful results by creating a large JSGF grammar or 2 medium-sized ARPA grammars plus adding some post-processing to the hypothesis. I’m currently working on the latter option, and I think I might be able to process the hypothesis well enough to produce a valid NSString for my purposes. One issue that I see coming up constantly is that the library gets tripped up between similar sounded words — for example it might interpret EIGHTY as EIGHT or vice versa. Would changing the grammar to JSGF help avoid this at all?

Also, it is my understanding that a JSGF grammar would allow multiple language model phrases in a single hypothesis, which would be necessary for my project. Is this the case, or would every hypothesis necessarily have to be strictly specified in the language model?

Thanks

November 18, 2011 at 12:49 am #8104

Halle Winkler
Politepix

Would changing the grammar to JSGF help avoid this at all?

If you have some kind of logical pattern that lets you set a rule, and the rule will only allow “eight” instead of “eighty” when it’s [preceded by x || followed by y || whatevered by whatever], it will help. If you don’t have the opportunity to use logic to rule out similar-sounding sentence parts it will help less, but it will still probably be helpful to restrict recognition to a subset of phrases you expect.

I don’t really understand this question fully, can you give an example:

Also, it is my understanding that a JSGF grammar would allow multiple language model phrases in a single hypothesis, which would be necessary for my project.

BTW, if you search around there is some alpha code for dynamic switching between JSGF grammars posted in another discussion here but I don’t know offhand what the link is.

November 22, 2011 at 10:03 am #8155

culov
Participant

What I meant was, when the hypothesis is calculated, will it allow for given phrases to be recombined — I figured out that the answer is “only if i want them to be”

I spent the day changing over to JSGF, and I believe that overall, it works significantly worse than the ARPA model. The accuracy is actually worse, and the processing takes 8-10 times as long. Even if the accuracy were perfect, the duration of the processing rules it out for me. Also, when I test it on an iPod Touch 4G, I get a memory warning right about the time pocket sphinx begins calibrating. I already simplified my grammar as much as possible. So now I am going back to ARPA and I am facing the same problem I had when I started this thread: Cleaning up the hypothesis of phrases that weren’t included in the grammar. I don’t know if I’ll be able to complete my project with satisfactory results using post-processing method, but unless you can offer me additional tips, I believe that it may be my last shot at getting this thing to work.

Thanks so much for your help

November 22, 2011 at 10:51 am #8157

Halle Winkler
Politepix

What I meant was, when the hypothesis is calculated, will it allow for given phrases to be recombined — I figured out that the answer is “only if i want them to be”

It still isn’t clear to me what your goal is here.

Out of curiosity, how large is your language model/grammar? 8-10x processing sounds really unusual to me, as does reduced accuracy for JSGF. A memory warning at calibration is also very unusual in my experience, especially on a device that recent.

Have you changed anything in your setup to make it not be stock? I.e. are you using Pocketsphinx .7 or have you made changes to the library? Question 2 is if you get the same results when using your grammar with the sample app, without making any other changes.

November 22, 2011 at 10:18 pm #8164

culov
Participant

Let me give you an example to better illustrate what I desire. For arguments sake let’s say a valid command would be something in the form: “number POUNDS numberTwo OUNCES” or “numberTwo OUNCES number POUNDS.” I want to reject, therefore, any sample input that doesn’t meet these qualifications in the speech detection level.

So my grammar then was simply: | POUNDS OUNCES | OUNCES POUNDS

otherKeywords contained about 1000 phrases, averaging 3 words per phrase, and was defined to match any number, up to 1000. numberTwo was 0-15.

I’ve made no modifications to the library except the changes suggested in this thead to keep system sounds. Haven’t tried using the grammar with the sample app yet– I will give this a try in a moment.

November 22, 2011 at 10:22 pm #8166

Halle Winkler
Politepix

OK, you might just find that a grammar with phrases comprising 3000 words just isn’t that performant on the hardware. The limits of in-device processing are far lower than what can be done server-side. But it’s definitely worth trying with the plain implementation just to see if there is a configuration issue.

November 22, 2011 at 10:23 pm #8167

Halle Winkler
Politepix

(What I mean is, your requirements may mean that you have to look at server-based solutions).

November 28, 2011 at 6:17 am #8188

culov
Participant

I tried with the plain implementation and the results were the same. I dropped the grammar down to about 60 words and it’s still much slower and less accurate than my ARPA model with thousands of phrases. Is this typical, or am I likely doing something wrong? Because most of my users will be using the app in an environment without internet access, I won’t be able to do a server-side implementation.

November 28, 2011 at 12:03 pm #8191

Halle Winkler
Politepix

That does not seem typical to me, but maybe one of the JSGF users around here can weigh in.

November 28, 2011 at 4:48 pm #8193

Joseph S. Wisniewski
Participant

> but maybe one of the JSGF users around here can weigh in.

155 pounds, 23.6% body fat.

OK, on a more serious note, the Pocketsphinx JSGF support is rather “broken”. Halle, you can make a FAQ of this or something, because I’m going to give a “FAQ level” answer…

Recognizers, like Sphinx, don’t care about “pretty” human-readable things like JSGF. They’re “machines”, and very simple machines at that, called “finite automata”. They “traverse” a “graph” (a “network”) of “nodes” (places) connected by “arcs” (paths that lead from place to place). It’s like driving following directions: take Main Street to Liberty Drive, to Town Center. The streets are the arc, the intersections are the nodes.

Writing that graph the way the recognizer needs is hard work. You can go crazy trying to create a list of nodes and arcs. If you can do it, Sphinx can accept it directly. I sometimes write my grammars graphically, drawing the graph in a program called yEd, and using a little program I wrote to turn the yEd graph into a Sphinx .fsg file. But that’s hard work, and beyond what most people want to do. So, recognizers typically include some sort of a “grammar language” compiler, like the JSGF support in Pocketsphinx, or other grammar notations used in Nuance or Dragon products, to convert human readable grammar to a FSG “graph” that the recognizer can use. So, if you had the grammar (not going to use full proper JSGF notation, because it looks too much like HTML and confuses this site, and I can’t remember where to put the slashes to fix it).

rule = (hi | hello) there halle

Sphinx would turn it into a graph like
0 1 hi
0 1 hello
1 2 there
2 3 halle

You can see how it does this, there’s a tool called sphinx_jsgf2fsg in the Sphinxbase distribution that calls the Sphinx JSGF compiler and lets you see the results. )Be careful, though, it only compiles and runs properly on computers, if you compile it on a fruit-flavored computer substitute, it will act quirky).

So, having that graph, Sphinx will follow it like a roadmap when recognizing speech, sitting at node 0 until the start of speech, then moving to node 1 if it got “hi” or “hello”, to node 2 when you said “there” and to node 3 when you said “halle”

There are two problems with the way Sphinx does this.

First is what I call the “null arc” bug.

And yes, by my standards, it is a bug, even if the Sphinx folks don’t agree. Any time you use anything inclusive (parenthesis, brackets, greater than and less than) or a star or plus operator, Sphinx adds unnecessary “null arcs” to the grammar. So, our example would become:

0 1 hi
0 1 hello
1 2
2 3 there
3 4 halle

The “null arc” from 1 to 2 doesn’t sound like much of a problem, does it?

Well, on complex, real world grammars, you might have dozens or hundreds of words, and each word in the graph may have several (or dozens) of arcs that connect it to other words. Add extra null arcs to just about everything, and you much more than double the size of the graph. I’ve had grammars where, when I draw the graph by hand in yEd, there are only 25 nodes. I compile the yEd graph into an actual Sphinx FSG using a tool I wrote, and it runs fine, as fast and efficient as a .lm language model. Better, in fact.

I express the same grammar as a .gram and use the Sphinx gram to fsg tool, and it grows into something with 150 nodes (yes, 6 times as many as it should have) and it runs like tar.

A bigger, more complex FSG not only runs slower, it decreases accuracy. Why?

Sphinx is a three-pass recognizer. If you use a language model file, it actually contains three different language models for the three different recognition passes. When you give Sphinx an FSG, or have it make you an FSG from a JSGF automatically, it generates a quick bigram and trigram language model (list of what word can follow another) from the FSG for the first and second passes, and only uses the actual FSG for the third pass, so the extra null arcs make it waste time evaluating things that can’t actually happen.

A proper “bigram” language model for the first FSG would be something like

start-hi
start-hello
hello-there
there-halle
halle-end

Add the null arc and we this
start-hello
start-hi
hello-null
hi-null
null-halle
halle-end

We’ve lost the important linguistic knowledge that “halle” always follows “hello” or “hi”.
The recognizer first pass has to search these extra arcs, it outputs longer sequences, here are more possibilities, and it gets “lost” more easily.

If it screws up a contrived 4 word example, imagine what it does to a real grammar.

The second problem is that Sphinx doesn’t optimize grammars.

Say I wrote a simple grammar with two rules:

rule1 = (hello | hi) halle
rule2 = (hello | hi) joseph

Then added a rule that merged these two rules into a final grammar

hello = rule1 | rule2

The optimal graph is
0 1 hello
0 1 hi
1 2 halle
1 2 joseph

There are only 4 ways to traverse that grammar, it’s easy…

Sphinx makes
0 1 hello
0 1 hi
1 2 halle
0 3 hello
0 3 hi
3 2 joseph

(Actually, it makes something a lot uglier, with about 12 null arcs, but that’s a different bug).

So, it has 4 things to explore on the first node, then 2 on each of 2 different second nodes. 8 paths, twice the work.

A computer has finite memory, so the recognizer only keeps a finite number of things in the “search space” and “prunes” away “less likely” things. On this trivial example, that’s not going to happen, but imagine it would. Say there was only enough memory to search 6 “hypothesis” at a time. The 4 possibility grammar always gets fully searched. The 8 possibility grammar doesn’t, only 6 of the 8 get searched. The correct answer may get thrown away.

The official Sphinx team stance is that complex tasks should only be done with language models in the recognizer, outputting “lattices” of things you might have said, and using a “natural language processor” to sort things out afterwords. Well, that works OK if you’re Siri, but on a “pocket” system, the FSG is the optimal (fastest responding and most accurate) way of dealing with command and control.

So, there’s three paths:
1) fix the Sphinx JSGF support ourselves, so that it generates a compact, optimized FSG.
2) generate your FSG outside Sphinx, with other tools, and plug it in. Pocketsphinx can already accept .fsg files, and it doesn’t take much to make OpenEars use them. You can even mix and match in one app, since the .gram is just a way of making Sphinx make its own .fsg internally.
3) use language models, which, as you’ve discovered, generate a ton of “illegal” responses.

Trilema, three alternatives, all with downsides.

November 28, 2011 at 5:21 pm #8194

Halle Winkler
Politepix

Joseph, thanks very much for the thorough answer. I agree that this would make a good FAQ entry, I will add it (or a more concise version of it) when I have a moment. I think the advice I should be giving for now is that users output an FSG using the appropriate tool (is it part of CMULTK?).

November 28, 2011 at 5:31 pm #8195

Halle Winkler
Politepix

Whoops, I see you already explained where the FSG tool is. I think I was just having an out-of-body experience due to you casting aspersions on the One True Computer (I keep my linux installs safely on my servers where they can’t interfere with real work ;) ).

November 28, 2011 at 5:34 pm #8196

Halle Winkler
Politepix

Joseph, what do you think about 4) tuning an LM by significantly raising the probabilities of desired trigrams or bigrams?

November 29, 2011 at 5:08 am #8197

culov
Participant

Thanks for the great answer, Joseph. My hope was to create a large JSGF grammar at runtime. It sounds like unless I can dynamically generate the FSG, I’m out of luck.

April 21, 2014 at 3:32 pm #1020919

Halle Winkler
Politepix

Please check out the new dynamic generation language for OpenEars added with version 1.7: https://www.politepix.com/2014/04/10/openears-1-7-introducing-dynamic-grammar-generation/

April 24, 2014 at 6:07 pm #1021027

Halle Winkler
Politepix

In addition to the dynamic grammar generation that has been added to stock OpenEars in version 1.7, there is also a new plugin called RuleORama which can use the same API in order to generate grammars which are a bit faster and compatible with RapidEars: https://www.politepix.com/ruleorama
Author

Posts

Viewing 18 posts - 1 through 18 (of 18 total)

You must be logged in to reply to this topic.