Reply To: OOV words – training ?

March 2, 2014 at 12:05 pm #1020404

Politepix

No, there is really no way that you can get really accurate phoneme transcriptions from words from other languages using English-language tools. What makes them work correctly in English is that they follow grapheme to phoneme rules which can be said to apply to English (to the extent that g2p ever works well with English), so by definition they have to give mixed results for other languages.

Here is how the two methods in OpenEars work in English. The first method tries to look up the word in the dictionary. If it isn’t found, the second method uses a set of rules for estimating the phonemes in a word based on the graphemes in it. These rules only apply to English – other languages have very different rules. Pure phoneme-based recognition doesn’t work well so IMO there is no way to reverse-engineer a pronunciation from a speech recognition utterance that will have a higher degree of accuracy than the fallback method. This problem will be extremely compounded by the fact that you will be doing recognition of people with accents that are very different than the speakers found in the English acoustic model that corresponds to the phonemes you’re looking for.

If this were my task I would try to break it down as follows:

1. Get a big word list of French names and a big word list of Dutch names. Both languages have much more consistent grapheme to phoneme rules than English, so it should be possible to find their phonemes in their native language by finding a source for those rules, for instance a g2p software library.
2. Create a map by which you can convert the phonemes found in those two word lists into closest approximations of the ones in the English set that are used by the acoustic model.
3. Add this combined list to the lookup list used by the main method.

But getting good results with the language model generator in a language other than English or Spanish is really not expected behavior, so this will definitely need some kind of strategy for obtaining more text data including pronunciation transcriptions.