Joseph S. Wisniewski

Forum Replies Created

Viewing 7 posts - 1 through 7 (of 7 total)

Advertisement: “NeatSpeech is great-sounding offline speech synthesis, compatible with iOS6.1, and you can even edit pronunciations!”

Author

Posts
November 20, 2018 at 8:25 pm in reply to: Are the RapidEars, Rejecto, and RuleORama demos still available? #1032609

Joseph S. Wisniewski
Participant

You’re welcome, and your fix on the front page works fine.

November 20, 2018 at 5:52 pm in reply to: Are the RapidEars, Rejecto, and RuleORama demos still available? #1032605

Joseph S. Wisniewski
Participant

I found the problem.

The “Try XXX” links on the front page banners are all broken, no matter what browser you use. They do something with “set-cart-qty” that just doesn’t work, and you get the “empty cart” error.

The “Download the XXX Demo” links on the individual product pages use “add-to-cart” and seem to work fine.

November 20, 2018 at 5:43 pm in reply to: Are the RapidEars, Rejecto, and RuleORama demos still available? #1032602

Joseph S. Wisniewski
Participant

I did try from a different site on a different ISP and very different router with Firefox and Chrome last night, and an Android phone on T-Mobile this morning, along with an iPad on a different WiFi and on AT&T.

None of that helped.

November 19, 2018 at 9:17 pm in reply to: Are the RapidEars, Rejecto, and RuleORama demos still available? #1032589

Joseph S. Wisniewski
Participant

Hi Halle,

It’s bee a while

Safari on a MacBook Pro with OS 10.13.6
Firefox 63.0.1 on the same MacBook Pro
Firefox 63.0.1 on a Windows 10 machine.

November 28, 2011 at 4:48 pm in reply to: Limit hypothesis to phrases defined in model #8193

Joseph S. Wisniewski
Participant

> but maybe one of the JSGF users around here can weigh in.

155 pounds, 23.6% body fat.

OK, on a more serious note, the Pocketsphinx JSGF support is rather “broken”. Halle, you can make a FAQ of this or something, because I’m going to give a “FAQ level” answer…

Recognizers, like Sphinx, don’t care about “pretty” human-readable things like JSGF. They’re “machines”, and very simple machines at that, called “finite automata”. They “traverse” a “graph” (a “network”) of “nodes” (places) connected by “arcs” (paths that lead from place to place). It’s like driving following directions: take Main Street to Liberty Drive, to Town Center. The streets are the arc, the intersections are the nodes.

Writing that graph the way the recognizer needs is hard work. You can go crazy trying to create a list of nodes and arcs. If you can do it, Sphinx can accept it directly. I sometimes write my grammars graphically, drawing the graph in a program called yEd, and using a little program I wrote to turn the yEd graph into a Sphinx .fsg file. But that’s hard work, and beyond what most people want to do. So, recognizers typically include some sort of a “grammar language” compiler, like the JSGF support in Pocketsphinx, or other grammar notations used in Nuance or Dragon products, to convert human readable grammar to a FSG “graph” that the recognizer can use. So, if you had the grammar (not going to use full proper JSGF notation, because it looks too much like HTML and confuses this site, and I can’t remember where to put the slashes to fix it).

rule = (hi | hello) there halle

Sphinx would turn it into a graph like
0 1 hi
0 1 hello
1 2 there
2 3 halle

You can see how it does this, there’s a tool called sphinx_jsgf2fsg in the Sphinxbase distribution that calls the Sphinx JSGF compiler and lets you see the results. )Be careful, though, it only compiles and runs properly on computers, if you compile it on a fruit-flavored computer substitute, it will act quirky).

So, having that graph, Sphinx will follow it like a roadmap when recognizing speech, sitting at node 0 until the start of speech, then moving to node 1 if it got “hi” or “hello”, to node 2 when you said “there” and to node 3 when you said “halle”

There are two problems with the way Sphinx does this.

First is what I call the “null arc” bug.

And yes, by my standards, it is a bug, even if the Sphinx folks don’t agree. Any time you use anything inclusive (parenthesis, brackets, greater than and less than) or a star or plus operator, Sphinx adds unnecessary “null arcs” to the grammar. So, our example would become:

0 1 hi
0 1 hello
1 2
2 3 there
3 4 halle

The “null arc” from 1 to 2 doesn’t sound like much of a problem, does it?

Well, on complex, real world grammars, you might have dozens or hundreds of words, and each word in the graph may have several (or dozens) of arcs that connect it to other words. Add extra null arcs to just about everything, and you much more than double the size of the graph. I’ve had grammars where, when I draw the graph by hand in yEd, there are only 25 nodes. I compile the yEd graph into an actual Sphinx FSG using a tool I wrote, and it runs fine, as fast and efficient as a .lm language model. Better, in fact.

I express the same grammar as a .gram and use the Sphinx gram to fsg tool, and it grows into something with 150 nodes (yes, 6 times as many as it should have) and it runs like tar.

A bigger, more complex FSG not only runs slower, it decreases accuracy. Why?

Sphinx is a three-pass recognizer. If you use a language model file, it actually contains three different language models for the three different recognition passes. When you give Sphinx an FSG, or have it make you an FSG from a JSGF automatically, it generates a quick bigram and trigram language model (list of what word can follow another) from the FSG for the first and second passes, and only uses the actual FSG for the third pass, so the extra null arcs make it waste time evaluating things that can’t actually happen.

A proper “bigram” language model for the first FSG would be something like

start-hi
start-hello
hello-there
there-halle
halle-end

Add the null arc and we this
start-hello
start-hi
hello-null
hi-null
null-halle
halle-end

We’ve lost the important linguistic knowledge that “halle” always follows “hello” or “hi”.
The recognizer first pass has to search these extra arcs, it outputs longer sequences, here are more possibilities, and it gets “lost” more easily.

If it screws up a contrived 4 word example, imagine what it does to a real grammar.

The second problem is that Sphinx doesn’t optimize grammars.

Say I wrote a simple grammar with two rules:

rule1 = (hello | hi) halle
rule2 = (hello | hi) joseph

Then added a rule that merged these two rules into a final grammar

hello = rule1 | rule2

The optimal graph is
0 1 hello
0 1 hi
1 2 halle
1 2 joseph

There are only 4 ways to traverse that grammar, it’s easy…

Sphinx makes
0 1 hello
0 1 hi
1 2 halle
0 3 hello
0 3 hi
3 2 joseph

(Actually, it makes something a lot uglier, with about 12 null arcs, but that’s a different bug).

So, it has 4 things to explore on the first node, then 2 on each of 2 different second nodes. 8 paths, twice the work.

A computer has finite memory, so the recognizer only keeps a finite number of things in the “search space” and “prunes” away “less likely” things. On this trivial example, that’s not going to happen, but imagine it would. Say there was only enough memory to search 6 “hypothesis” at a time. The 4 possibility grammar always gets fully searched. The 8 possibility grammar doesn’t, only 6 of the 8 get searched. The correct answer may get thrown away.

The official Sphinx team stance is that complex tasks should only be done with language models in the recognizer, outputting “lattices” of things you might have said, and using a “natural language processor” to sort things out afterwords. Well, that works OK if you’re Siri, but on a “pocket” system, the FSG is the optimal (fastest responding and most accurate) way of dealing with command and control.

So, there’s three paths:
1) fix the Sphinx JSGF support ourselves, so that it generates a compact, optimized FSG.
2) generate your FSG outside Sphinx, with other tools, and plug it in. Pocketsphinx can already accept .fsg files, and it doesn’t take much to make OpenEars use them. You can even mix and match in one app, since the .gram is just a way of making Sphinx make its own .fsg internally.
3) use language models, which, as you’ve discovered, generate a ton of “illegal” responses.

Trilema, three alternatives, all with downsides.

September 13, 2011 at 3:18 pm in reply to: Way to see phonemes OpenEars heard #7604

Joseph S. Wisniewski
Participant

I do this as a diagnostic technique.

* Build a dictionary with 40 words, each word being just one of the CMU phonemes
* Build a language model or FSG where each of the words can follow any other word

Fair warning, the results will be very strange.

tidigits is a very “clean” grammar, it works well with fluent speech. And, like most 8k models, it’s most effective on adult males. Kids work best with 16k models (so do women). I’d suggest switching over to VoxForge 0.4, from the CMU site. To make this work well, though, you really need models built from kid speech.

You could try model adaptation.

August 22, 2011 at 9:39 pm in reply to: Detecting single letters in the alphabet #7496

Joseph S. Wisniewski
Participant

Letter recognition can only be done in conjunction with a spelling application. In other words, if you have a list of street names, spelling

W O O D W A R D

will work, if you use an n-best list and search through the dictionary for the results. As long as your task can be constrained by a dictionary, even a huge dictionary, you’re OK. You’ll have to patch OpenEars for N-best output, though, and build an FSG or LM. The LM will work better if you build it from your dictionary.

If your letter sequences truly are random, you’re dealing with something that’s beyond the state of the art. It’s beyond the state of the art for human listeners, too. Give it a try, read some random letter sequences to people and see how many they get wrong.

Is this something you’re still working on?
Author

Posts

Viewing 7 posts - 1 through 7 (of 7 total)