Are you very confident that your test of calibrating at the beginning versus raising the noise level later is an apples-to-apples test, i.e. with the exact same music, segment of that music, and volumes? The reason I ask is that the voice audio detection is already reset under the circumstances you’re describing, i.e. it is immediately taking in new noise floor info any time that it returns right after a big jump in noise.
It’s also a surprising result that loud music playing back at or above speech levels would not be categorized as non-silence whether it happened at the beginning of a new session or in the middle of one. It would normally be a correct categorization to call that non-silence because it is occurring at or above the volume level of speech and the voice activity detector doesn’t have a way of distinguishing between music-like sound and speech-like sound. If you are confident that you get a different result with an initial calibration against the exact same noise level I will be up for testing it and trying to improve that.