Reply To: Change in scoring with 2.04

Home Forums OpenEars Change in scoring with 2.04 Reply To: Change in scoring with 2.04

Halle Winkler


That’s correct, it is mentioned in the changelog:

The change came about as a side-effect of a very positive development – for a couple of years, OpenEars developers have been reporting variations of issues in which certain searches can go on endlessly and ultimately freeze listening as well as potentially take up too much memory, often due to quiet noises that should be treated as fillers (random noises) or treated as silences, but which the engine attempted to evaluate as speech in its third pass and got into an enormous search since these little sounds could be almost anything. The Sphinx project was kind enough to investigate this OpenEars-originating issue as a bug and they fixed it this year, which has a bunch of positive results for us: no more endless searches, a bit more accuracy across the board, better resource use, and for the Rejecto users, Rejecto can now use the third pass without issue so OpenEars with Rejecto integrated is notably more accurate.

However, it changed scoring because the fix relates to how fillers and silences are scored by the engine internally. I noticed the scoring change when the Sphinx version with the fix was first released and the two projects have been discussing it since, but a) it may not be a bug but rather the original scoring being inaccurately high due to wrong estimation of fillers and silences, and b) it is (legitimately) not necessarily a very high priority for them since scoring has never been recommended for API confidence estimation by either project to the best of my knowledge, more on that in a moment.

My original position on this was to wait to release these changes until scoring was similar to the old version, but recently a very bad and intractable bug affecting many developers turned out to again be related to 3rd-pass searches that were overrating fillers, so I decided to bite the bullet and release 2.04 with all of these improvements but different scoring. I sincerely apologize for the unexpected changes and just to give you a heads-up, they could also still change back again to the old scale or yet a different one in the case that Sphinx does more work on this area, or if I get feedback from them that a couple of different ways I’ve investigated to restore the previous scoring scale are harmless.

The reason I’m making a rare exception to the need for API consistency in the case of scoring is that I’ve been saying for years that developers shouldn’t use scores with any kind of arbitrary numbers. It is unfortunately not possible to use a fixed range to decide high confidence scoring with Sphinx because speaker, accent, gender, environment, speaking speed and length of input, and device will result in enormous variances across multiple app sessions, much greater than your mentioned range of -800 to 0. As an example, an accurate score for an utterance from me in a quiet room used to be about -10000, so an -800 to 0 range would never have marked any of my utterances as confident using OpenEars, although as you can guess, OpenEars does a good job of recognizing my speech accurately.

Scoring has only ever been advised for use within a single session to compare relative scores to each other within that session. If you want to continue with scoring against arbitrary values, the best way to do it is to take some fixed audio input that arrived in your -800 to 0 range previously and check what it now is scored as. It isn’t a simple linear or logarithmic scale change – I’ve spent a few days investigating it and there is no single transformation that gets you from the old score scale to the new one, so that would have to be evaluated on a case-by-case basis.

However, I strongly recommend not to ever evaluate scores against constant values, since you’re very likely to just end up with an app that reacts to men and not women, or people with midwestern accents and not coastal, or certain devices and not others, etc. Back when I used to also offer app integration services, the most common type of project that I got was fixing apps which used scoring in this way and then had classes of user whose speech was ignored.

I hope this explains a bit more about why the scoring scale hasn’t been protected as an API feature and also why it’s assumed that they are still usable for the recommended purpose of comparison to each other inside of a single app session. Again, I do apologize for this being in flux. Although I’m dubious about the use of scoring, it’s my very strong preference not to throw you any curveballs and if this weren’t being weighed against bad recognition and app freezes I wouldn’t do it.