There are a number of issues with attempting this. The first is that in Pocketsphinx you can’t get scoring for each word (although this is an option with RapidEars). But the bigger issue is that the impact of pronunciation on scoring is much less than the impact of other factors. One big factor is environmental, i.e., background noises, which mic is used, mic distance, mic differences across different devices, room characteristics. Another is whether the acoustic characteristics of the speaker are well-represented in the acoustic model. For instance, I’ve received reports that very high female voices aren’t as accurately recognized as other male and female voices, and this will manifest in scores to a subtler extent with other underrepresented vocal characteristics. Another factor is that the scoring accuracy is heavily impacted by the size of the vocabulary. The problem is trying to separate these large factors from the relatively small factor of pronunciation quality.
When developers ask about using the scores I always have the same advice: they can only be used to compare the same utterances within a single session. Meaning that you can gather scores about a number of utterances of the same statement from a single speaker in a single environment in a single app session and say that some scores are better than others and give that feedback to the user if you want. But you can never compare scores across multiple speakers, multiple devices, and multiple usage sessions in order to derive meaningful information, such as saying that a word utterance with a score of -2000 should always be understood to be a correct pronunciation.