- This topic has 15 replies, 5 voices, and was last updated 10 years, 10 months ago by ader.
-
AuthorPosts
-
April 25, 2011 at 6:51 pm #3986jeff-kelleyParticipant
I’m trying to implement an app that reads letters and numbers. I’ve had some inaccuracies for certain letter combinations—“AB” is consistently heard as ”8Y,” for instance, and am wondering if there are any configuration options that might help. I’ve had some success replacing letters with equivalents: ‘b’ with “bee,” etc. I have had much more success using the NATO alphabet (alpha, beta, etc.), but we can’t expect our users to be able to use it. So… what’s the best way to approach single letters? Thanks in advance.
April 25, 2011 at 9:09 pm #3987Halle WinklerPolitepixDoes the app read letters and numbers or recognize them in the user’s speech?
April 26, 2011 at 2:09 pm #3988jeff-kelleyParticipantThe goal is for the user to read letters and numbers to be recognized by the app.
April 26, 2011 at 3:27 pm #3989Halle WinklerPolitepixWould it be possible for you to show me your language model?
April 26, 2011 at 3:50 pm #3990jeff-kelleyParticipantSure. We used the lmtool available on CMU’s website with this corpus:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z zero one two three four five six seven eight nine ten
We got this language model:
Language model created by QuickLM on Tue Apr 26 11:48:40 EDT 2011 Copyright (c) 1996-2010 Carnegie Mellon University and Alexander I. Rudnicky The model is in standard ARPA format, designed by Doug Paul while he was at MITRE. The code that was used to produce this language model is available in Open Source. Please visit http://www.speech.cs.cmu.edu/tools/ for more information The (fixed) discount mass is 0.5. The backoffs are computed using the ratio method. This model based on a corpus of 37 sentences and 39 words data ngram 1=39 ngram 2=74 ngram 3=37 1-grams: -0.7782 </s> -0.3010 -0.7782 <s> -0.2218 -2.3464 A -0.2218 -2.3464 B -0.2218 -2.3464 C -0.2218 -2.3464 D -0.2218 -2.3464 E -0.2218 -2.3464 EIGHT -0.2218 -2.3464 F -0.2218 -2.3464 FIVE -0.2218 -2.3464 FOUR -0.2218 -2.3464 G -0.2218 -2.3464 H -0.2218 -2.3464 I -0.2218 -2.3464 J -0.2218 -2.3464 K -0.2218 -2.3464 L -0.2218 -2.3464 M -0.2218 -2.3464 N -0.2218 -2.3464 NINE -0.2218 -2.3464 O -0.2218 -2.3464 ONE -0.2218 -2.3464 P -0.2218 -2.3464 Q -0.2218 -2.3464 R -0.2218 -2.3464 S -0.2218 -2.3464 SEVEN -0.2218 -2.3464 SIX -0.2218 -2.3464 T -0.2218 -2.3464 TEN -0.2218 -2.3464 THREE -0.2218 -2.3464 TWO -0.2218 -2.3464 U -0.2218 -2.3464 V -0.2218 -2.3464 W -0.2218 -2.3464 X -0.2218 -2.3464 Y -0.2218 -2.3464 Z -0.2218 -2.3464 ZERO -0.2218 2-grams: -1.8692 <s> A 0.0000 -1.8692 <s> B 0.0000 -1.8692 <s> C 0.0000 -1.8692 <s> D 0.0000 -1.8692 <s> E 0.0000 -1.8692 <s> EIGHT 0.0000 -1.8692 <s> F 0.0000 -1.8692 <s> FIVE 0.0000 -1.8692 <s> FOUR 0.0000 -1.8692 <s> G 0.0000 -1.8692 <s> H 0.0000 -1.8692 <s> I 0.0000 -1.8692 <s> J 0.0000 -1.8692 <s> K 0.0000 -1.8692 <s> L 0.0000 -1.8692 <s> M 0.0000 -1.8692 <s> N 0.0000 -1.8692 <s> NINE 0.0000 -1.8692 <s> O 0.0000 -1.8692 <s> ONE 0.0000 -1.8692 <s> P 0.0000 -1.8692 <s> Q 0.0000 -1.8692 <s> R 0.0000 -1.8692 <s> S 0.0000 -1.8692 <s> SEVEN 0.0000 -1.8692 <s> SIX 0.0000 -1.8692 <s> T 0.0000 -1.8692 <s> TEN 0.0000 -1.8692 <s> THREE 0.0000 -1.8692 <s> TWO 0.0000 -1.8692 <s> U 0.0000 -1.8692 <s> V 0.0000 -1.8692 <s> W 0.0000 -1.8692 <s> X 0.0000 -1.8692 <s> Y 0.0000 -1.8692 <s> Z 0.0000 -1.8692 <s> ZERO 0.0000 -0.3010 A </s> -0.3010 -0.3010 B </s> -0.3010 -0.3010 C </s> -0.3010 -0.3010 D </s> -0.3010 -0.3010 E </s> -0.3010 -0.3010 EIGHT </s> -0.3010 -0.3010 F </s> -0.3010 -0.3010 FIVE </s> -0.3010 -0.3010 FOUR </s> -0.3010 -0.3010 G </s> -0.3010 -0.3010 H </s> -0.3010 -0.3010 I </s> -0.3010 -0.3010 J </s> -0.3010 -0.3010 K </s> -0.3010 -0.3010 L </s> -0.3010 -0.3010 M </s> -0.3010 -0.3010 N </s> -0.3010 -0.3010 NINE </s> -0.3010 -0.3010 O </s> -0.3010 -0.3010 ONE </s> -0.3010 -0.3010 P </s> -0.3010 -0.3010 Q </s> -0.3010 -0.3010 R </s> -0.3010 -0.3010 S </s> -0.3010 -0.3010 SEVEN </s> -0.3010 -0.3010 SIX </s> -0.3010 -0.3010 T </s> -0.3010 -0.3010 TEN </s> -0.3010 -0.3010 THREE </s> -0.3010 -0.3010 TWO </s> -0.3010 -0.3010 U </s> -0.3010 -0.3010 V </s> -0.3010 -0.3010 W </s> -0.3010 -0.3010 X </s> -0.3010 -0.3010 Y </s> -0.3010 -0.3010 Z </s> -0.3010 -0.3010 ZERO </s> -0.3010 3-grams: -0.3010 <s> A </s> -0.3010 <s> B </s> -0.3010 <s> C </s> -0.3010 <s> D </s> -0.3010 <s> E </s> -0.3010 <s> EIGHT </s> -0.3010 <s> F </s> -0.3010 <s> FIVE </s> -0.3010 <s> FOUR </s> -0.3010 <s> G </s> -0.3010 <s> H </s> -0.3010 <s> I </s> -0.3010 <s> J </s> -0.3010 <s> K </s> -0.3010 <s> L </s> -0.3010 <s> M </s> -0.3010 <s> N </s> -0.3010 <s> NINE </s> -0.3010 <s> O </s> -0.3010 <s> ONE </s> -0.3010 <s> P </s> -0.3010 <s> Q </s> -0.3010 <s> R </s> -0.3010 <s> S </s> -0.3010 <s> SEVEN </s> -0.3010 <s> SIX </s> -0.3010 <s> T </s> -0.3010 <s> TEN </s> -0.3010 <s> THREE </s> -0.3010 <s> TWO </s> -0.3010 <s> U </s> -0.3010 <s> V </s> -0.3010 <s> W </s> -0.3010 <s> X </s> -0.3010 <s> Y </s> -0.3010 <s> Z </s> -0.3010 <s> ZERO </s> end
The trouble is that it’s just not accurate enough distinguishing letters. I’m very new at using OpenEars/PocketSphinx, so really I just don’t know how to approach improving accuracy.
April 26, 2011 at 4:16 pm #3991Halle WinklerPolitepixCan I also see the dictionary? I’m surprised to hear that it is recognizing EIGHT Y for A B; the EIGHT isn’t surprising but the Y is. Is EIGHT Y an accurate transcription of what Pocketsphinx heard? What is the hypothesis (verbatim)?
April 26, 2011 at 5:33 pm #3992jeff-kelleyParticipantSure, here’s the dictionary:
A AH A(2) EY B B IY C S IY D D IY E IY EIGHT EY T F EH F FIVE F AY V FOUR F AO R G JH IY H EY CH I AY J JH EY K K EY L EH L M EH M N EH N NINE N AY N O OW ONE W AH N ONE(2) HH W AH N P P IY Q K Y UW R AA R S EH S SEVEN S EH V AH N SIX S IH K S T T IY TEN T EH N THREE TH R IY TWO T UW U Y UW V V IY W D AH B AH L Y UW X EH K S Y W AY Z Z IY ZERO Z IH R OW ZERO(2) Z IY R OW
EIGHT Y is accurate, I don’t have the transcription from before, though. I’ll try to get it to go again and post back here.
April 26, 2011 at 5:37 pm #3993jeff-kelleyParticipantWith the dictionary/language model here, it’s giving me KB more frequently than AB (I was speaking “A B” each time):
2011-04-26 13:35:23.315 OpenEarsSampleProject[2328:707] Pocketsphinx calibration has started. 2011-04-26 13:35:23.368 OpenEarsSampleProject[2328:707] Pocketsphinx calibration is complete. 2011-04-26 13:35:23.374 OpenEarsSampleProject[2328:707] Pocketsphinx has stopped listening. 2011-04-26 13:35:23.382 OpenEarsSampleProject[2328:707] Pocketsphinx is starting up. 2011-04-26 13:35:23.835 OpenEarsSampleProject[2328:707] Pocketsphinx calibration has started. 2011-04-26 13:35:27.405 OpenEarsSampleProject[2328:707] Pocketsphinx calibration is complete. 2011-04-26 13:35:27.418 OpenEarsSampleProject[2328:707] Pocketsphinx is now listening. 2011-04-26 13:35:30.269 OpenEarsSampleProject[2328:707] Pocketsphinx has detected speech. 2011-04-26 13:35:32.267 OpenEarsSampleProject[2328:707] The received hypothesis is U THREE with a score of -495 and an ID of 000000000 2011-04-26 13:35:32.328 OpenEarsSampleProject[2328:707] Pocketsphinx is now listening. 2011-04-26 13:35:38.896 OpenEarsSampleProject[2328:707] Pocketsphinx has detected speech. 2011-04-26 13:35:40.254 OpenEarsSampleProject[2328:707] The received hypothesis is A B with a score of -13365 and an ID of 000000001 2011-04-26 13:35:40.344 OpenEarsSampleProject[2328:707] Pocketsphinx is now listening. 2011-04-26 13:35:42.953 OpenEarsSampleProject[2328:707] Pocketsphinx has detected speech. 2011-04-26 13:35:45.211 OpenEarsSampleProject[2328:707] The received hypothesis is J V with a score of -16556 and an ID of 000000002 2011-04-26 13:35:45.284 OpenEarsSampleProject[2328:707] Pocketsphinx is now listening. 2011-04-26 13:35:47.264 OpenEarsSampleProject[2328:707] Pocketsphinx has detected speech. 2011-04-26 13:35:49.059 OpenEarsSampleProject[2328:707] The received hypothesis is K P with a score of -15090 and an ID of 000000003 2011-04-26 13:35:49.115 OpenEarsSampleProject[2328:707] Pocketsphinx is now listening. 2011-04-26 13:35:52.527 OpenEarsSampleProject[2328:707] Pocketsphinx has detected speech. 2011-04-26 13:35:54.067 OpenEarsSampleProject[2328:707] The received hypothesis is K B with a score of -14514 and an ID of 000000004 2011-04-26 13:35:54.136 OpenEarsSampleProject[2328:707] Pocketsphinx is now listening. 2011-04-26 13:35:55.405 OpenEarsSampleProject[2328:707] Pocketsphinx has detected speech. 2011-04-26 13:35:57.607 OpenEarsSampleProject[2328:707] The received hypothesis is K V with a score of -25266 and an ID of 000000005 2011-04-26 13:35:57.661 OpenEarsSampleProject[2328:707] Pocketsphinx is now listening. 2011-04-26 13:35:59.633 OpenEarsSampleProject[2328:707] Pocketsphinx has detected speech. 2011-04-26 13:36:01.581 OpenEarsSampleProject[2328:707] The received hypothesis is A B with a score of -12390 and an ID of 000000006 2011-04-26 13:36:01.654 OpenEarsSampleProject[2328:707] Pocketsphinx is now listening. 2011-04-26 13:36:03.892 OpenEarsSampleProject[2328:707] Pocketsphinx has detected speech. 2011-04-26 13:36:05.588 OpenEarsSampleProject[2328:707] The received hypothesis is K B with a score of -6112 and an ID of 000000007 2011-04-26 13:36:05.978 OpenEarsSampleProject[2328:707] Pocketsphinx is now listening. 2011-04-26 13:36:08.074 OpenEarsSampleProject[2328:707] Pocketsphinx has detected speech. 2011-04-26 13:36:09.750 OpenEarsSampleProject[2328:707] The received hypothesis is K B O with a score of -37412 and an ID of 000000008 2011-04-26 13:36:09.828 OpenEarsSampleProject[2328:707] Pocketsphinx is now listening. 2011-04-26 13:36:12.141 OpenEarsSampleProject[2328:707] Pocketsphinx has detected speech. 2011-04-26 13:36:13.764 OpenEarsSampleProject[2328:707] The received hypothesis is A B with a score of -16145 and an ID of 000000009 2011-04-26 13:36:13.821 OpenEarsSampleProject[2328:707] Pocketsphinx is now listening. 2011-04-26 13:36:16.829 OpenEarsSampleProject[2328:707] Pocketsphinx has detected speech. 2011-04-26 13:36:18.718 OpenEarsSampleProject[2328:707] The received hypothesis is K B with a score of -18317 and an ID of 000000010 2011-04-26 13:36:18.790 OpenEarsSampleProject[2328:707] Pocketsphinx is now listening.
April 27, 2011 at 12:48 pm #3994Halle WinklerPolitepixOK, I think the first easy step is to get rid of the pronunciations that are in the dictionary that you definitely don’t want to recognize. I realize this isn’t at all self-evident so I’ll explain briefly. If you look at this from the dictionary:
A AH
A(2) EYThat means that the language model tool gave you back two possible pronunciations for the word A. The first one is the particular NA pronunciation of the article “a” as in “a dog barked” that rhymes with “huh”. Since you don’t ever want to recognize that pronunciation of “a” because the alphabet character is never pronounced that way, you should erase that pronunciation from your dictionary.
The (2) in parentheses just means that it is the second pronunciation of the word, so the way you would want to replace
A AH
A(2) EYis with the line
A EY
deleting the first pronunciation, and removing the (2) from the second pronunciation since it is now the only pronunciation you are going to accept.
The next thing that you can do is to make the sentence “A B” part of your corpus. The corpus can have individual words, but it can also contain combinations of words. Combinations of words that you have made part of your corpus will have an automatically higher probability of being detected.
So, the corpus would say something like this:
A
B
A B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
ONE
TWO
THREE
FOUR
FIVE
SIX
SEVEN
EIGHT
NINE
ZEROYou can do this for all of the possible combinations if you want to, or just the ones where you want to raise their probability of being detected. When you look at the language model that is output, you will see that there is a 2-gram entry for A B and that it has a raised probability.
April 27, 2011 at 2:29 pm #3995jeff-kelleyParticipantInteresting. I’ll try pruning the dictionary of ambiguous pronunciations where possible. I’d like to do more with combinations like “A B”, but the that the characters being read to this application are random, so there won’t really be a pattern to which ones get combined more.
Thanks for this help; I’ll report back if it’s more successful.
April 27, 2011 at 2:34 pm #3996jeff-kelleyParticipantI’m still getting a lot of Ks where I should be getting As, but I think I’m going down the right path. Thanks again.
June 1, 2011 at 8:27 pm #3997Halle WinklerPolitepixJust as an update on this, I’ve been gradually learning that individual letters/syllables are a challenging case and expectations for accuracy should probably be lower than for whole word or phrase recognition.
August 22, 2011 at 9:39 pm #7496Joseph S. WisniewskiParticipantLetter recognition can only be done in conjunction with a spelling application. In other words, if you have a list of street names, spelling
W O O D W A R D
will work, if you use an n-best list and search through the dictionary for the results. As long as your task can be constrained by a dictionary, even a huge dictionary, you’re OK. You’ll have to patch OpenEars for N-best output, though, and build an FSG or LM. The LM will work better if you build it from your dictionary.
If your letter sequences truly are random, you’re dealing with something that’s beyond the state of the art. It’s beyond the state of the art for human listeners, too. Give it a try, read some random letter sequences to people and see how many they get wrong.
Is this something you’re still working on?
June 20, 2012 at 5:05 pm #9916alexlParticipantI am working on a similar project and having the same issues. As a matter of fact, recognition results are very poor. The biggest problems I am having with letters E, D, P, A. Below is the dictionary file that I compiled. In many cases it recognizes E as P, D, C. Letter A recognizes as 8 or H.
What techniques would you recommend for improving accuracy? Doesn’t more combinations for the same letter help?
0 Z IY R OW
1 W AH N
2 T UW
3 TH R IY
4 F OW R
5 F AY V
6 S IH K S
7 S EH V AH N
8 EY T
9 N AY N
A AH
A(2) EY
B B IY
C S IY
D D IY
E IY
E(2) EH
E(3) IH
F EH F
G JH IY
H EY CH
H(2) EY JH
I AY
J JH EY
K K EY
L EH L
M EH M
M(2) AE M
N EH N
N(2) AE N
O OW
P P IY
Q K Y UW
R AA R
S EH S
S(2) AE S
T T IY
U Y UW
V V IY
W D AH B AH L Y UW
X EH K S
X(2) AE K S
Y W AY
Z Z IYJune 22, 2012 at 12:35 pm #9960Halle WinklerPolitepixThis is not a good application of the library, unfortunately.
July 19, 2012 at 10:16 am #10584aderParticipant“As long as your task can be constrained by a dictionary”
Joseph, is there a way we can do this? e.g. only recognise “sentences” e.g. “w o o d w a r d” and not individual words e.g. “w”
I’m attempting to create such a spelling app
-
AuthorPosts
- You must be logged in to reply to this topic.