Change in scoring with 2.04

Home Forums OpenEars Change in scoring with 2.04

Viewing 4 posts - 1 through 4 (of 4 total)

  • Author
    Posts
  • #1026249
    tonyknight
    Participant

    Hi,

    We are using both OpenEars and Rejecto since last year, and we have noticed the change in scoring that was introduced by updating Sphinx.

    Can you explain the change? We were used to seeing most of our high confidence recognition in the -800 to 0 range. What would be the equivalent now?

    Thanks,
    Tony

    #1026251
    Halle Winkler
    Politepix

    Hello,

    That’s correct, it is mentioned in the changelog:

    https://politepix.com/openears/changelog/

    The change came about as a side-effect of a very positive development – for a couple of years, OpenEars developers have been reporting variations of issues in which certain searches can go on endlessly and ultimately freeze listening as well as potentially take up too much memory, often due to quiet noises that should be treated as fillers (random noises) or treated as silences, but which the engine attempted to evaluate as speech in its third pass and got into an enormous search since these little sounds could be almost anything. The Sphinx project was kind enough to investigate this OpenEars-originating issue as a bug and they fixed it this year, which has a bunch of positive results for us: no more endless searches, a bit more accuracy across the board, better resource use, and for the Rejecto users, Rejecto can now use the third pass without issue so OpenEars with Rejecto integrated is notably more accurate.

    However, it changed scoring because the fix relates to how fillers and silences are scored by the engine internally. I noticed the scoring change when the Sphinx version with the fix was first released and the two projects have been discussing it since, but a) it may not be a bug but rather the original scoring being inaccurately high due to wrong estimation of fillers and silences, and b) it is (legitimately) not necessarily a very high priority for them since scoring has never been recommended for API confidence estimation by either project to the best of my knowledge, more on that in a moment.

    My original position on this was to wait to release these changes until scoring was similar to the old version, but recently a very bad and intractable bug affecting many developers turned out to again be related to 3rd-pass searches that were overrating fillers, so I decided to bite the bullet and release 2.04 with all of these improvements but different scoring. I sincerely apologize for the unexpected changes and just to give you a heads-up, they could also still change back again to the old scale or yet a different one in the case that Sphinx does more work on this area, or if I get feedback from them that a couple of different ways I’ve investigated to restore the previous scoring scale are harmless.

    The reason I’m making a rare exception to the need for API consistency in the case of scoring is that I’ve been saying for years that developers shouldn’t use scores with any kind of arbitrary numbers. It is unfortunately not possible to use a fixed range to decide high confidence scoring with Sphinx because speaker, accent, gender, environment, speaking speed and length of input, and device will result in enormous variances across multiple app sessions, much greater than your mentioned range of -800 to 0. As an example, an accurate score for an utterance from me in a quiet room used to be about -10000, so an -800 to 0 range would never have marked any of my utterances as confident using OpenEars, although as you can guess, OpenEars does a good job of recognizing my speech accurately.

    Scoring has only ever been advised for use within a single session to compare relative scores to each other within that session. If you want to continue with scoring against arbitrary values, the best way to do it is to take some fixed audio input that arrived in your -800 to 0 range previously and check what it now is scored as. It isn’t a simple linear or logarithmic scale change – I’ve spent a few days investigating it and there is no single transformation that gets you from the old score scale to the new one, so that would have to be evaluated on a case-by-case basis.

    However, I strongly recommend not to ever evaluate scores against constant values, since you’re very likely to just end up with an app that reacts to men and not women, or people with midwestern accents and not coastal, or certain devices and not others, etc. Back when I used to also offer app integration services, the most common type of project that I got was fixing apps which used scoring in this way and then had classes of user whose speech was ignored.

    I hope this explains a bit more about why the scoring scale hasn’t been protected as an API feature and also why it’s assumed that they are still usable for the recommended purpose of comparison to each other inside of a single app session. Again, I do apologize for this being in flux. Although I’m dubious about the use of scoring, it’s my very strong preference not to throw you any curveballs and if this weren’t being weighed against bad recognition and app freezes I wouldn’t do it.

    #1026286
    tonyknight
    Participant

    Hi,

    Thank you for the thoughtful answer.

    We had started using recognition scores about a year ago when we noticed that some noises were being recognized as words in our relatively small vocabulary (about 15 words) A door would slam, and it would be recognized as a very short word in the vocabulary. We noticed that when a person said that word, it would be somewhere between -800 and 0, so we ignored commands lower that -2000. This ended up working quite well on a large group of testers.

    Later, we added a ambiguous detection algorithm for lower confidence recognition scores that fell inside a range lower than -800. If a phrase was detected in this range (say -800 to -2000), the user would need to repeat it on the next detection for it to be accepted. This cut down lots of false positives.

    Another use for the ambiguous detection routine was in detecting first and last names. Our app allows user to tag photos using contacts from their phone or from family tree software. One thing we don’t want users to do is get the first name from one contact and the second name from another contact to form the first and last name someone they are tagging. We enforce this by going back to our data model to compare the recognized name against the database. We will reject it if they don’t match, but if it is within the ambiguous range we will accept it if it is repeated on the very next detection.

    Since we started doing this, OpenEars has definitely gotten better, and perhaps our rational for doing this doesn’t exist anymore. We will do some testing with a larger group of people on an altered scale to see if it still useful.

    Thanks,
    Tony

    #1026288
    Halle Winkler
    Politepix

    Hi Tony,

    That is interesting and it sounds like you’ve tested carefully and approached the scoring in a considered way. My main recommendation for excluding false positives is by the adjustment of vadThreshold combined with Rejecto and Rejecto’s weight argument specifically rather than scores, but it certainly doesn’t sound like your approach was found in an unconsidered way.

    I would recommend retesting a bit and seeing if you still need scores, especially if you use Rejecto, because there have been improvements in several updates starting with 2.0 and Rejecto should be notably improved in 2.041. Based on all experience over the last 5 years of OpenEars, I do think it is likely that there are classes of users, classes of devices, classes of mics, and classes of operating distances leading to frustration with a -2000 to -800 cutoff (as I mentioned, that would put 100% of my speech squarely in the ambiguous category when I am testing under ideal conditions, and I’m a clear speaker with a Northeast US accent, but being female gives a large score reduction even when recognition remains excellent just because the training corpus is biased towards male speakers). It is challenging to construct tests which take all of the factors which affect scoring into account.

    If you retest and decide to keep your scoring but adjust the scale, I recommend keeping your old score cutoffs around just in case I fix this scoring change in an update soon. If a lot of time goes by without an obvious fix I am going to let it go, but if there is a good way to adjust it back to the previous scale without affecting the positive improvements that I can find in the next 12 weeks or so, I will release an update restoring the old scoring scale.

Viewing 4 posts - 1 through 4 (of 4 total)
  • You must be logged in to reply to this topic.