The score is a negative number representing decreasing probability as their distance from zero increases. It has very limited application in an app — you can use it to compare to another score within the same session, speaker and environment only, but scores should never be compared with each other or with a constant across multiple sessions or speakers. The reason for this is that the score is very heavily influenced by the speaker, their accent and other speech characteristics, the mic used, the distance from the mic, and the background noise. So if you try to pick an arbitrary number that means “accurate score” for everyone, you will end up excluding all recognition for speakers who don’t match the profile of your test speaker plus their test environment. It’s better to ignore it in nearly all cases.
I regret including the scores in the callbacks since the first versions of OpenEars because now they would be hard to remove, but they don’t bring a lot to the table and are often looked at as a way to refine accuracy confidence across multiple users, which they aren’t particularly good for since the engine itself has already used the scoring for that to the extent it is useful. You can use them to see if accuracy for a particular utterance has increased or decreased within the same session.
The task you are doing is sort of on the line between keyword spotting and command-and-control, which means that out of vocabulary recognition (the engine hearing a word that isn’t in the vocabulary and matching it to a word that is in the vocabulary) is a big issue as you’ve discovered. This is the usage case for which Rejecto was developed.