Halle Winkler

Just for some background on why it’s like this, for speech perception purposes there isn’t a big improvement in perception for higher sampling rates than 16k (and mono), which means that most speech recognition software will attempt recognition with a maximum of a 16k sample rate because it means there are far fewer samples that have to be analyzed. For non-speech applications such as music it’s naturally always going to be be better to use a higher sample rate and stereo if possible. But generally, even for speech that humans listen to, you also don’t get a lot of extra “bang for the buck” for going from 16k to 44.1k because the comparison standard is telephone bandwidth, which is generally standardized at 8k and compressed, making 16k PCM already a big step up. The reason that the recognition is compromised is that it assumes that a “chunk” of speech is likely to occur within a certain number of samples in a timeframe, and it’s more like 3x the samples in which the speech is occurring, so it is really not going to map well to the recordings which are in the acoustic model (which are actually 8k but the input functions compensate for the doubling of the input rate)