Reply To: Limit hypothesis to phrases defined in model

Home Forums OpenEars Limit hypothesis to phrases defined in model Reply To: Limit hypothesis to phrases defined in model

#8193

> but maybe one of the JSGF users around here can weigh in.

155 pounds, 23.6% body fat.

OK, on a more serious note, the Pocketsphinx JSGF support is rather “broken”. Halle, you can make a FAQ of this or something, because I’m going to give a “FAQ level” answer…

Recognizers, like Sphinx, don’t care about “pretty” human-readable things like JSGF. They’re “machines”, and very simple machines at that, called “finite automata”. They “traverse” a “graph” (a “network”) of “nodes” (places) connected by “arcs” (paths that lead from place to place). It’s like driving following directions: take Main Street to Liberty Drive, to Town Center. The streets are the arc, the intersections are the nodes.

Writing that graph the way the recognizer needs is hard work. You can go crazy trying to create a list of nodes and arcs. If you can do it, Sphinx can accept it directly. I sometimes write my grammars graphically, drawing the graph in a program called yEd, and using a little program I wrote to turn the yEd graph into a Sphinx .fsg file. But that’s hard work, and beyond what most people want to do. So, recognizers typically include some sort of a “grammar language” compiler, like the JSGF support in Pocketsphinx, or other grammar notations used in Nuance or Dragon products, to convert human readable grammar to a FSG “graph” that the recognizer can use. So, if you had the grammar (not going to use full proper JSGF notation, because it looks too much like HTML and confuses this site, and I can’t remember where to put the slashes to fix it).

rule = (hi | hello) there halle

Sphinx would turn it into a graph like
0 1 hi
0 1 hello
1 2 there
2 3 halle

You can see how it does this, there’s a tool called sphinx_jsgf2fsg in the Sphinxbase distribution that calls the Sphinx JSGF compiler and lets you see the results. )Be careful, though, it only compiles and runs properly on computers, if you compile it on a fruit-flavored computer substitute, it will act quirky).

So, having that graph, Sphinx will follow it like a roadmap when recognizing speech, sitting at node 0 until the start of speech, then moving to node 1 if it got “hi” or “hello”, to node 2 when you said “there” and to node 3 when you said “halle”

There are two problems with the way Sphinx does this.

First is what I call the “null arc” bug.

And yes, by my standards, it is a bug, even if the Sphinx folks don’t agree. Any time you use anything inclusive (parenthesis, brackets, greater than and less than) or a star or plus operator, Sphinx adds unnecessary “null arcs” to the grammar. So, our example would become:

0 1 hi
0 1 hello
1 2
2 3 there
3 4 halle

The “null arc” from 1 to 2 doesn’t sound like much of a problem, does it?

Well, on complex, real world grammars, you might have dozens or hundreds of words, and each word in the graph may have several (or dozens) of arcs that connect it to other words. Add extra null arcs to just about everything, and you much more than double the size of the graph. I’ve had grammars where, when I draw the graph by hand in yEd, there are only 25 nodes. I compile the yEd graph into an actual Sphinx FSG using a tool I wrote, and it runs fine, as fast and efficient as a .lm language model. Better, in fact.

I express the same grammar as a .gram and use the Sphinx gram to fsg tool, and it grows into something with 150 nodes (yes, 6 times as many as it should have) and it runs like tar.

A bigger, more complex FSG not only runs slower, it decreases accuracy. Why?

Sphinx is a three-pass recognizer. If you use a language model file, it actually contains three different language models for the three different recognition passes. When you give Sphinx an FSG, or have it make you an FSG from a JSGF automatically, it generates a quick bigram and trigram language model (list of what word can follow another) from the FSG for the first and second passes, and only uses the actual FSG for the third pass, so the extra null arcs make it waste time evaluating things that can’t actually happen.

A proper “bigram” language model for the first FSG would be something like

start-hi
start-hello
hello-there
there-halle
halle-end

Add the null arc and we this
start-hello
start-hi
hello-null
hi-null
null-halle
halle-end

We’ve lost the important linguistic knowledge that “halle” always follows “hello” or “hi”.
The recognizer first pass has to search these extra arcs, it outputs longer sequences, here are more possibilities, and it gets “lost” more easily.

If it screws up a contrived 4 word example, imagine what it does to a real grammar.

The second problem is that Sphinx doesn’t optimize grammars.

Say I wrote a simple grammar with two rules:

rule1 = (hello | hi) halle
rule2 = (hello | hi) joseph

Then added a rule that merged these two rules into a final grammar

hello = rule1 | rule2

The optimal graph is
0 1 hello
0 1 hi
1 2 halle
1 2 joseph

There are only 4 ways to traverse that grammar, it’s easy…

Sphinx makes
0 1 hello
0 1 hi
1 2 halle
0 3 hello
0 3 hi
3 2 joseph

(Actually, it makes something a lot uglier, with about 12 null arcs, but that’s a different bug).

So, it has 4 things to explore on the first node, then 2 on each of 2 different second nodes. 8 paths, twice the work.

A computer has finite memory, so the recognizer only keeps a finite number of things in the “search space” and “prunes” away “less likely” things. On this trivial example, that’s not going to happen, but imagine it would. Say there was only enough memory to search 6 “hypothesis” at a time. The 4 possibility grammar always gets fully searched. The 8 possibility grammar doesn’t, only 6 of the 8 get searched. The correct answer may get thrown away.

The official Sphinx team stance is that complex tasks should only be done with language models in the recognizer, outputting “lattices” of things you might have said, and using a “natural language processor” to sort things out afterwords. Well, that works OK if you’re Siri, but on a “pocket” system, the FSG is the optimal (fastest responding and most accurate) way of dealing with command and control.

So, there’s three paths:
1) fix the Sphinx JSGF support ourselves, so that it generates a compact, optimized FSG.
2) generate your FSG outside Sphinx, with other tools, and plug it in. Pocketsphinx can already accept .fsg files, and it doesn’t take much to make OpenEars use them. You can even mix and match in one app, since the .gram is just a way of making Sphinx make its own .fsg internally.
3) use language models, which, as you’ve discovered, generate a ton of “illegal” responses.

Trilema, three alternatives, all with downsides.