Adding custom words to the LanguageModelGeneratorList.txt

This topic has 25 replies, 3 voices, and was last updated 8 years ago by Powerkey.

Viewing 26 posts - 1 through 26 (of 26 total)

Advertisement: “NeatSpeech is great-sounding offline speech synthesis, compatible with iOS6.1, and you can even edit pronunciations!”

Author

Posts
January 22, 2015 at 10:32 am #1024493

kunal.shah
Participant

Hi Halle,
I have a huge list of words that I’d like the recogniser to recognise. To increase the accuracy, I’d dynamically assign shorter lists. But, in this case, every time the app runs, every time it sees new words, I get the following:

Using convertGraphemes for the word or phrase SAMEVAR which doesn’t appear in the dictionary

Once this process is done, the recogniser works fine.
I’d like to know if there’s a method by which these words can be added to the LMGList.txt, so that everytime the app loads, the same words don’t have to be added. (The new words, obviously || Not the manual method of adding)

Thanks in advance,
Kunal.

January 22, 2015 at 10:49 am #1024494

Halle Winkler
Politepix

Yes, you can add new words to the master phonetic lookup dictionary. This is a very good practice in a case like yours where there are words which are proper names with a very specific pronunciation but they derive from other languages (I almost recommended that you do that yesterday but I didn’t want to overwhelm you with info while you were still doing some initial troubleshooting, so I’m happy to see you figured it out). I wrote a blog post some time back about how to do it – the only big change you need to keep in mind now is that you can just directly edit LanguageModelGeneratorList.txt rather than using the methods described in the post:

https://www.politepix.com/2012/12/04/openears-tips-and-tricks-5-customizing-the-master-phonetic-dictionary-or-using-a-new-one/

So when you read the post, the import part is about the required formatting, order, and how to find the phonetic transcription for your new words for when you add things to AcousticModelEnglish.bundle/LanguageModelGeneratorList.txt, but you don’t have to worry about the programmatic stuff about how to reference a special custom file. If it’s working you’ll see the word found without having to fall back to the grapheme generator.

January 22, 2015 at 11:21 am #1024495

kunal.shah
Participant

Hi,
I’ve already read the article. It was pretty helpful! Writing the code to add into the text file, shouldn’t be too strenuous, yes. But! The issue here is, where do I get the pronunciation breakup for each word?
That is for example – CAT —- >K AE T

Is the only option to manually figure out the pronunciation for each word or is something like a word to phonemes function that would do it for me?
So that once I run the app, it gives me the phonemes for all the unknown words, after which, I can add it to the master.

January 22, 2015 at 11:27 am #1024496

Halle Winkler
Politepix

If you want, you can just get them out of the .dic file that is generated by LanguageModelGenerator at the moment – you have the path to it, since that is what you are passing to startListening:. The fallback generator is a word to phonemes function so you can just re-use its results.

However, you will get better accuracy results if you follow my advice from the blog post and seek out rhymes in the master lookup dictionary and do it manually.

January 22, 2015 at 11:49 am #1024497

kunal.shah
Participant

Hi,
I found the .dic file by doing the following:

Window -> Organizer -> select your’s Iphone -> Applications -> select your iphone project name and you can see blow in the “Data files in Sandbox”

The problem here is, I can’t select or open the .dic file.

Any way around this?

January 22, 2015 at 12:05 pm #1024552

kunal.shah
Participant

Ok! I’ve got the .dic problem solved for now.
Can’t find the fallback method though?

January 22, 2015 at 12:13 pm #1024553

Halle Winkler
Politepix

You don’t call the method directly, you just use the results from it having been called when your .dic file was created.

January 22, 2015 at 12:17 pm #1024554

kunal.shah
Participant

Hi,
I’m just clarifying a doubt. As of what I comprehend,

1.Once I run the app, for all the words that are not in the dictionary, the fallback method generates the pronunciations for them and adds them to the .dic file. The next time the app is run, the same words won’t go through the fallback method again.

2.A simpler solution to this would be, once a initialise the language array with my entire finite list of word that I want the recogniser to under.
Then I run the app, open the .dic file and this would already contain the entire list of words(not already contained before). Then I can just copy this list into the master, so that no new words are added every time the app runs.

The only backdrop to this would be, I’d have to do this every time I want to add new irregular words for the recogniser to identify.

Is this right?

January 22, 2015 at 12:50 pm #1024555
Halle Winkler
Politepix
1.Once I run the app, for all the words that are not in the dictionary, the fallback method generates the pronunciations for them and adds them to the .dic file. The next time the app is run, the same words won’t go through the fallback method again.

No – if you look at the logging, it shows that the fallback method is being run every time you run the app. That is the situation which prompted you to compose this question, I believe:

every time the app runs, every time it sees new words, I get the following:

Using convertGraphemes for the word or phrase SAMEVAR which doesn’t appear in the dictionary

Your question above is whether you can change LanguageModelGeneratorList.txt so that the fallback method isn’t used, which you can – by adding entries to it as described in the blog post.

Then you followed up and asked a new question, which was where to find pronunciations that were automatically generated so you wouldn’t have to create them yourself according to my suggestion for finding rhyming words in the blog post. I explained that you should actually create them manually, but that fallback pronunciations are being automatically created when you create your .dic file using OELanguageModelGenerator, so you can get all of the automatically-created pronunciations out of the .dic file that is being created by OELanguageModelGenerator.

Take the .dic file that is generated in the app session that has all of the logging like this:
```
Using convertGraphemes for the word or phrase SAMEVAR which doesn’t appear in the dictionary
```
And add its lines which are not present in LanguageModelGeneratorList.txt to LanguageModelGeneratorList.txt, using a text editor. This will not give you as good results as making careful manual entries, but it is the way to get all of the automatically-created pronunciations that were made by the fallback method.
January 22, 2015 at 1:10 pm #1024556

kunal.shah
Participant

The issue I’m facing here is,

2015-01-22 17:35:35.177 [1946:244530] The word NAWAZI was not found in the dictionary /private/var/mobile/Containers/Bundle/Application/B552E17A-CA8E-494B-BA37-99E932F83859/Zomato-Merchant.app/AcousticModelEnglish.bundle/LanguageModelGeneratorLookupList.text/LanguageModelGeneratorLookupList.text.
2015-01-22 17:35:35.178 [1946:244530] Now using the fallback method to look up the word NAWAZI
2015-01-22 17:35:35.180 [1946:244530] If this is happening more frequently than you would expect, the most likely cause for it is since you are using the English phonetic lookup dictionary is that your words are not in English or aren’t dictionary words, or that you are submitting the words in lowercase when they need to be entirely written in uppercase. This can also happen if you submit words with punctuation attached – consider removing punctuation from language models or grammars you create before submitting them.
2015-01-22 17:35:35.181 [1946:244530] Using convertGraphemes for the word or phrase NAWAZI which doesn’t appear in the dictionary

I get the above in the log, though I’ve added the word NAWAZI in the LanguageModelGeneratorLookupList.txt:

NAW N AA
NAWAZI N AH W AA Z IY //Added word
NAWROCKI N AA V R OW T S K IY

January 22, 2015 at 1:18 pm #1024557

Halle Winkler
Politepix

It will either be because the formatting is incorrect (it has to be exactly as described in the blog post, nothing extra, nothing different – if your “Added word” comment above is present in the file, that must be removed) or it is being added to the wrong LanguageModelGeneratorLookupList.txt file that isn’t the one ending up in your app bundle.

If it is formatted correctly and present in your app bundle it will be found and used, just like all the other words in there are being found.

You can verify its presence in your app bundle by getting your app bundle and seeing what is in its LanguageModelGeneratorLookupList.txt. Regarding the formatting, please read the blog post section on formatting very carefully.

This is just a minor oversight of some kind, so take your time and do standard troubleshooting steps to find what is happening.

January 22, 2015 at 1:58 pm #1024558

kunal.shah
Participant

Ok! So, I’ve tried everything you said and followed the formatting too. But, it still doesn’t seem to work and goes into the fallback method every time.

The path of the file here is:

OpenEars Framework->AcousticModelEnglish.bundle->LanguageModelGeneratorLookupList.txt

This is the file to which I add the new words, right?
If so, I’ve done it and still goes into the fallback function for the new words.

January 22, 2015 at 2:03 pm #1024559

kunal.shah
Participant

Also, what I just did was, delete an existing word from the .txt file and check if it goes into the fallback method, it does.
But, when I just paste the same line back, it goes into the fallback method AGAIN.

January 22, 2015 at 2:04 pm #1024560

Halle Winkler
Politepix

Troubleshooting information from previous post:

You can verify its presence in your app bundle by getting your app bundle and seeing what is in its LanguageModelGeneratorLookupList.txt.

January 22, 2015 at 2:17 pm #1024561

Halle Winkler
Politepix

But, when I just paste the same line back, it goes into the fallback method AGAIN.

Yes, but this is going to be due to a misattribution of some kind. It’s already evident that the words which are present in LanguageModelGeneratorLookupList.txt get looked up by OELanguageModelGenerator successfully as long as they are formatted correctly and are present in your app, since otherwise, 100% of your words would be using the fallback method.

This is what is documented in the blog post:

The right formatting is as follows: the word in all-capital letters, followed by a tab, followed by the transcription of its pronunciation, with nothing more until the end of the line.

This is an advanced application of OpenEars’ capabilities, so it is on you to verify formatting details like making sure that the entry has no space at the beginning, one tab (not a run of spaces) before the phonetic transcription, and a line break immediately after the phonetic transcription, so that it matches all of the other entries. If somehow your copy/paste is converting tabs to runs of spaces, that is important for you to troubleshoot on your end because it is a local setting of some kind.

It is also important for you to take the standard troubleshooting steps for a situation where you aren’t sure whether your file modifications are ending up in your app, such as looking at the contents of your app bundle to verify whether your changes are ending up there.

January 22, 2015 at 2:33 pm #1024562

kunal.shah
Participant

Ok! So, what I’ve just figured is, if I open the generated .dic file’s entries of the unknown words and copy and paste them to the .txt file, it works fine. But, if I create my own entry it doesn’t.
I have a question here? If I define a pronunciation different from the one the fallback method generates, is it still a problem?

January 22, 2015 at 2:40 pm #1024563

Halle Winkler
Politepix

So, what I’ve just figured is, if I open the generated .dic file’s entries of the unknown words and copy and paste them to the .txt file, it works fine. But, if I create my own entry it doesn’t.

That just means that your entry has different formatting. Do you have a text editor that will let you “see invisibles”? It will probably show you that your “tabs” are actually runs of spaces or that you are somehow pasting in a different form of linebreak, or that your letters aren’t UTF8.

If I define a pronunciation different from the one the fallback method generates, is it still a problem?

From my post above:

However, you will get better accuracy results if you follow my advice from the blog post and seek out rhymes in the master lookup dictionary and do it manually.

January 22, 2015 at 2:53 pm #1024564

kunal.shah
Participant

Ok, I shall figure that out!

Thank you for your help and time.

Best,
Kunal.

January 22, 2015 at 2:57 pm #1024565

Halle Winkler
Politepix

You’re welcome, good luck!

April 20, 2016 at 7:18 pm #1030108

Powerkey
Participant

Hi Halle,

1. Can I create a phrase in the LookupList? i.e..

JOHN HENRY DOE<tab>JH AA N HH EH N R IY D OW

2. Can the phrases contain punctuation like ‘ (apostrophe) and – (dash)? i.e.

JOHN-HENRY DOE<tab>JH AA N HH EH N R IY D OW

3. I recall reading on the forum somewhere the the words do not need to be in all caps any more. Is that correct? i.e..

John Henry Doe<tab>JH AA N HH EH N R IY D OW

April 20, 2016 at 7:29 pm #1030109

Halle Winkler
Politepix

Hello,

Can I create a phrase in the LookupList?

Yes, I recommend it for words or phrases you know you’ll be generating dictionaries for, but you can’t have any spaces in the word entry before the tab. You could also add JOHN and HENRY and DOE separately and a request for the phrase “John-Henry Doe” should then find all of your added words. Take care to put your additions into the alphabetically-correct location in the file, since the English-language lookup uses the order to optimize lookups.

Can the phrases contain punctuation like ‘ (apostrophe) and – (dash)?

Yes, these are the two allowed forms of punctuation.

I recall reading on the forum somewhere the the words do not need to be in all caps any more. Is that correct?

Yes, but this regards the generation of the language model or grammar, not the LanguageModelGeneratorLookupList entries, which should just be added in whatever case the rest of the list is in (uppercase, lowercase, or mixed case). The framework will make sure to normalize requests and the entries to the same case during lookups so you don’t have to worry about case.

April 20, 2016 at 8:08 pm #1030111

Powerkey
Participant

Okay, great.

…but you can’t have any spaces in the word entry before the tab.

If no spaces are allowed before the tab, then I need to add the phrases as “John-Henry-Doe”. Are the dashes ignored when generating? Do they change inflections or timing anything like that?

You could also add JOHN and HENRY and DOE separately and a request for the phrase “John-Henry Doe” should then find all of your added words.

I do not want to add the names separately as only specific full names are used. For instance “John Doe” and “John Henry Doe” are used but “John Henry” is not. My thinking is that the recognition will be more accurate with phrases than it will be with simpler words that are used to make the phrase. True?

Yes, but this regards the generation of the language model or grammar, not the LanguageModelGeneratorLookupList entries, which should just be added in whatever case the rest of the list is in (uppercase, lowercase, or mixed case). The framework will make sure to normalize requests and the entries to the same case during lookups so you don’t have to worry about case.

Okay, to clarify the case answer… If my Array contains “John-Henry-Doe”, and the entry in the LookupList is “john-henry-doe” will that match? And what is returned in the string of recognized text?

April 21, 2016 at 10:46 am #1030130

Halle Winkler
Politepix

Are the dashes ignored when generating?

Ignored in what respect?

Do they change inflections or timing anything like that?

No.

I do not want to add the names separately as only specific full names are used. For instance “John Doe” and “John Henry Doe” are used but “John Henry” is not. My thinking is that the recognition will be more accurate with phrases than it will be with simpler words that are used to make the phrase. True?

True.

Okay, to clarify the case answer… If my Array contains “John-Henry-Doe”, and the entry in the LookupList is “john-henry-doe” will that match?

Yes.

And what is returned in the string of recognized text?

What is returned is what you put in your array submitted to OELanguageModelGenerator. If that is “John-Henry-Doe”, “John-Henry-Doe” gets returned.

April 21, 2016 at 6:32 pm #1030152

Powerkey
Participant

Awesome, Thanks.

Editing these entries for the last day or so, I think I can now write in full grapheme!

EY K AA N S IY TH R UW T EY M!

April 21, 2016 at 6:42 pm #1030153

Halle Winkler
Politepix

G L AE D T UW HH EH L P

April 22, 2016 at 1:39 am #1030155

Powerkey
Participant

I just noticed that “key_west” in the English LanguageModelGeneratorLookupList.txt has an ‘_’ (underscore) instead of a ‘-‘ (dash).

Not sure if that is intentional or not.
Author

Posts

Viewing 26 posts - 1 through 26 (of 26 total)

You must be logged in to reply to this topic.