HomeForumsOpenEarsAdapting openEars to detect one-syllable consonant-vowel pairs

Tagged: 

This topic has 4 voices, contains 14 replies, and was last updated by  Halle 135 days ago.

Viewing 15 posts - 1 through 15 (of 15 total)
Author Posts
Author Posts
November 15, 2011 at 9:51 pm #8071

Pi

Japanese language works on a 9×5 grid. 9 Consonants, 5 Vowels. so 9×5=45 CV-Pairs. One CV-pair ( like Fu’ ‘Ji’ ‘Ka’ ‘Zu’ ‘Mo’ ‘To’ etc ) is called a Kana ( http://en.wikipedia.org/wiki/Kana” )

I would like to detect a stream of Kana in real-time.

looking at the Sphinx documentation, it seems I just need to change the .dic and .languagemodel, which I did — http://cl.ly/BqZY

now this gives an error in SphinxTrain.c

so I went onto the CMUSphinx channel on IRC, and the resident expert told me:

“to load jsgf grammar you need to use -jsgf option instead of -lm option by default which is used by openears”

however he then went on to say ‘actually from the log it seems openears creates grammar itself’

looking through the open ears Xcode project, I cannot figure out how to make use of this information. I can’t see what I need to do.

Can anyone help?

π

PS I would be willing to pay someone to get this working… please anyone interested e-mail me (sunfish7|gmail|c0m)
PPS but if anyone can answer here, I am tremendously grateful!

  • This reply was modified 185 days ago by  Pi.
  • This reply was modified 185 days ago by  Pi.
  • This reply was modified 185 days ago by  Pi.
November 15, 2011 at 10:01 pm #8075

Pi

somehow I seem to have lost the ability to edit my own post:

I should add more information… from the screenshot you will be able to see how I’m attempting to catch a stream of Kana, using

grammar KanaStream;

public = BAH | BEH | BEE | BOR | BOO;

public = … ( I can’t write this because the HTML highlighting gets confused by it… it is in the screenshot )

I am not at all confident this is correct. I got it from reading http://cmusphinx.sourceforge.net/wiki/tutoriallm

also I’m not sure how such a system will output data. will it actually output ‘BAH BEH BEE etc as I speak these Consonant-Vowel pairs? or is some modification to the code required?

π

  • This reply was modified 185 days ago by  Pi.
  • This reply was modified 185 days ago by  Pi.
November 15, 2011 at 10:13 pm #8078

Halle

Take a look at the OpenEars documentation to see how to load JSGF in OpenEars rather than a language model. It’s documented on this page:

http://www.politepix.com/openears/yourapp

OpenEars does not create a .gram file itself, you must add it and set the startListening: method so that it knows JSGF is being used.

I’d be very surprised if recognition functioned well with your consonant-vowel pairs since those are only one or two phonemes, but the way it will work (if it works) is that after a detected silence on the part of the speaker, recognition will be attempted. You can see how this works by running the sample app.

November 15, 2011 at 11:04 pm #8079

Halle

A secondary issue with this besides the problem of trying to detect non-words using an acoustic model trained with phrases is that the English-language acoustic model of OpenEars will further not be able to detect Japanese phonemes except in the rare cases that they both overlap with English phonemes and happen to have been successfully recognized.

November 15, 2011 at 11:22 pm #8080

Pi

Hi Halle,

Thanks for the replies, and thanks for this incredible work you have done. Looking through this project, I would never have managed to do this myself. It is a Herculean task!

I am so stupid! A few weeks back I installed it and only read the help as far as getting the sample app working. I was unaware of the help page you linked. And it couldn’t have been any easier to find. My bad.

Ok, so it looks on first sight as if I just need to change a NO to a YES on the last parameter of

[self.pocketsphinxController startListeningWithLanguageModelAtPath:self.pathToGrammarToStartAppWith
dictionaryAtPath:self.pathToDictionaryToStartAppWith
languageModelIsJSGF:TRUE];

and it seems I need to do this five times or so… looks like there are several situations that warrant starting the engine up.

however, this is still giving a runtime error.

then I noticed the documentation specifies that a grammar file shouldn’t have the extension .languagemodel but instead should be .gram

so I changed the file name. also I ran a search through the code and found one instance that required changing:

self.pathToGrammarToStartAppWith = [NSString stringWithFormat:@"%@/%@",[[NSBundle mainBundle] resourcePath], @”OpenEars1.gram”];

I also found a few other references but couldn’t figure out whether I should change it anywhere else. So I didn’t.

now it runs. But as soon as I say something it breaks.

http://cl.ly/BqUT

Not sure how to push it forwards now…

PS as regards your comments, firstly I believe there is a good chance it will work. I am choosing a set of phonemes that are all linearly independent of one another so to speak. So if I’m using k I will not use g. if I use t I will not use d. etc. so I’m cutting down from say 35 down to under 15. also by only using a minimal dictionary of CV pairs, I will not be using a lot of combinations. hopefully the engine can look through the dictionary and pick out all the tri-phones being used, and only work with this set. if it does that, I’ll be using a very small fraction of the triphones of a full language model. maybe 2%.

secondly, I’m not actually at all interested in Japanese. Sorry, misleading post title. I’m just interested in getting the computer to recognise which CV pair was recognised on the grid. so I would be able to speak any combination and it would hit them spot on. it is for a speech keyboard ( which I need for myself as I have chronic RSI ). I am looking at training my own acoustic model if necessary, but I would like to try this experiment first; I don’t really fancy speaking gibberish into the microphone for five days solid. and I am daunted by the task of incorporating all of that into this framework. even this task I am out of my depth with. I just put it as Japanese because I have recently discovered they use this grid for their language; I am fascinated. I was hoping to find some Kana recognition software, I’m disappointed I can’t find any. It was still in my head when I was writing the post…

November 15, 2011 at 11:35 pm #8081

Halle

Thank you. If you want debugging help, you should read this for info about what logging is necessary to post:

http://www.politepix.com/forums/topic/install-issues-and-their-solutions/

November 16, 2011 at 12:03 am #8082

Pi

here is the log: http://pastebin.com/hzqdYzAU

EDIT: Oops found it

line 205: JSGF parse of /var/mobile/Applications/A97B621D-16D9-4BF8-B00B-85D3FF3635A6/OpenEarsSampleProject.app/OpenEars1.gram failed

so I need to go back to the grammar documentation http://cmusphinx.sourceforge.net/wiki/tutoriallm I guess

Nicolai has just told me the latest version of pocketSphinx reports more detailed information…

  • This reply was modified 184 days ago by  Pi.
November 16, 2011 at 12:35 am #8084

Pi

ok don’t know what I was thinking the first time I wrote the grammar file, it was all wrong.

However, I’ve gone through the documentation and I can’t see what is wrong with this:

http://cl.ly/Bq6t

ERROR: “fsg_search.c”, line 322: The word ‘and’ is missing in the dictionary

I am following the syntax for a recursive grammar to the letter. I can’t figure out why it is drawing an error. any ideas?

( I can’t paste the grammar is text because it won’t display properly — it uses the same angled brackets as HTML formatting )

November 16, 2011 at 12:53 am #8085

Pi

I am an idiot, I thought ‘and’ was a reserved keyword.

removing it, it works!

although it is indeed unusable. even with only five words Bah Beh Bii Boh Buu, most of the time, say 70%, it gets it right, but for example I might say ‘ Bee Bee Bah ‘ and it catches ‘Bee Bee Bah Beh’

which makes me scratch my head. it has completely invented what is to my ears a very unmistakable ‘B’ sound.

it’s not as if it just got the phoneme wrong. it has actually got the wrong number of syllables.

tomorrow I will play around with changing the phonemes, but it is looking like game over :|

November 16, 2011 at 11:17 am #8087

Halle

For what it’s worth, 70% is a lot better than I would have expected for what is basically a phoneme detection application.

November 16, 2011 at 2:17 pm #8088

Pi

I am going to experiment a little further, by switching to a Spanish model ( Spanish only has 5 vowel phonemes IIRC ) I may be able to get a significant improvement.

however, I am kind of disturbed by the fact that if I speak k syllables, it is pretty much random how many syllables come back

if I say ‘ba ba boo boo bee’ it may give ‘boo boo bee’

if I say ‘ba boo ba’ it might give ‘ba boo beh bor’

without knowing the intricacies of the algorithm I have no idea whether there is any possibility of getting a decent recognition.

PS thanks for correcting the topic!

  • This reply was modified 184 days ago by  Pi.
November 19, 2011 at 1:13 am #8142

Joseph S. Wisniewski

There’s not a Spanish model worth bothering with in Sphinx format, and the vowel coordinates are pretty far off.

You’ll also find that there’s a lot of Japanese CV pairs beyond the basic 5×9. Have a look at the phoneme set used for the Julius recognition system. Julius is a robust Japanese free recognizer. There may be scripts out there to convert the Julius acoustic models to Sphinx format.

Good luck.

November 19, 2011 at 10:40 am #8143

Pi

Hey Joseph,

I’m starting to see the same people in all the speech recognition hangouts now :)

I am working with an HTK engineer to build my own acoustic model, then I am going to test it in HTK and if it is decent I should be able to drop it into OpenEars ( hopefully?! ). I will post back to the thread when I get some result either way.

PS I’m going to check out Julius as well…

  • This reply was modified 181 days ago by  Pi.
January 4, 2012 at 2:41 pm #8365

aerialcombat

I’m sort of having a similar problem. I would be very much interested to know how everything turned out.

January 4, 2012 at 3:50 pm #8366

Halle

This kind of a task is unlikely to give really satisfactory results for a commercial app, pretty much. The context of entire words and phrases such as the hmm was built with is not really that dispensable for decent results. I think it’s more something to experiment with or do research on than to base an app concept on that you’re expecting to ship in the near term which does a great job at recognizing different syllables.

There are a few ideas like this that come up a lot here and in other similar places (although they get asked more here than elsewhere, I’ve noticed, I think because I hear about a large variety of small and practical commercial projects here) which are just very difficult challenges: keyword spotting, recognizing digits, syllable detection, pronunciation correctness rating. I would say that these are dangerous tasks to base an expensive project on (expensive either in terms of your income-productive development time as an indie dev, or a client or employer’s budget) and should probably be rethought in that case, but they are interesting to pursue in lower-stakes situations.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic.