You can’t use synthesized text for testing recognition, it has to be a real speaker.
I’m confused about the idea of single word phrases — are they single words, or phrases?
Unfortunately you are always going to see reduced accuracy when the speaker has an accent, unfair as it is. What is the accuracy rate you are seeing?
Take a look at the .dic file that is output for the words which aren’t found in the cmu dictionary, because if the fallback method gets the pronunciation wrong, it won’t be recognized correctly.