The issue here is one with speech recognition in general. If you have the words twenty and one in your model, there is no difference in phonemes between the user utterance “twenty-one” and “twenty, one” from the perspective of the engine (we can hear some elision in the compound version but that is probably not accounted for much in the general-purpose acoustic model and it isn’t going to be written out differently in the phonetic dictionary). It’s a tossup which version will be recognized and it sounds like there is a bias for the non-compound version.

Something that could help would be to manually increase the probability of the compound number in your language model. If you show the .arpa contents here I can give some hints about what kind of alterations to make. This is kind of hacky IMO but I’ve had good results with it as an approach to similar puzzles in the past.