Andy - you're on it. My study over the decades has taught me that listening is far more active and synthetic than we would assume. Auditory input is pretty sketchy in that sound pressure moving the auditory cilia must be fundamentally interpreted for meaning. That interpretation occurs in many parts of the brain and is associated with many different functions of memory, emotion and cognitive processing. We're making up most of what we hear.
Sound can be broken into two processing categories much like light acts as photon particles and waves. The wave aspect of sound relates to the frequency domain of pitch and timbre. The particle aspect is the time domain. The gating mechanisms you reference have more to do with time than frequency. Temporal propagation occurs in real, interpretable space. Any sound, such as a finger snap, arrives at the listening pair of ears with time information that allows us to know what it is as well as where it comes from, including its reflective and absorptive environment. As you allude, the processing power of the auditory brain would be overloaded without organizing mechanisms. One such mechanism is the time threshold, generally considered about 5 milliseconds. Components within that 5ms envelope are conflated into the original sound, while those arriving afterward are treated as reflections/echos. Of note is that those sub 5ms components are perceived as slurred or de-focused when the various frequencies of the arrival transient cannot be combined into a sensible single event. A real acoustic sound source (the finger snap) arrives with all (frequencies of) transients intact and its reflections off the nearby boundaries also intact. The analytical fore-brain figures out / decides the nature of the source (the snap) and the particulars of the walls of the room. And we are very good at it, being necessary for survival.
Trouble comes when aspects of the transient event have been compromised by the reproduction process. Though there are many opportunities to compromise this transient information, the most pervasive is that scrambling introduced by non time/phase coherent loudspeakers where various frequency bands arrive at the ear at different times than a real-life intact signal would. In that case, the auditory brain must analyze the sonic elements and synthesize an opinion of its nature (finger snap). It must also repeat that analysis for each reflection. Those additional layers of decoding are processor-intensive and serve to distance the whole listener from the heard experience. One fine twist is that the more sophisticated the listener, the more he tolerates / succeeds at the cognitive process of figuring out what is being heard. Therefore I trust the aural impressions of non-sophisticated listeners simply because they are in closer contact with the whole, natural auditory experience, whereas the sophisticated listener can "overlook" the deficiencies of a temporally inaccurate sound because his skill enables him to "hear" it despite its shortcomings. Teenage girls are my first choice for test listeners.
Sound can be broken into two processing categories much like light acts as photon particles and waves. The wave aspect of sound relates to the frequency domain of pitch and timbre. The particle aspect is the time domain. The gating mechanisms you reference have more to do with time than frequency. Temporal propagation occurs in real, interpretable space. Any sound, such as a finger snap, arrives at the listening pair of ears with time information that allows us to know what it is as well as where it comes from, including its reflective and absorptive environment. As you allude, the processing power of the auditory brain would be overloaded without organizing mechanisms. One such mechanism is the time threshold, generally considered about 5 milliseconds. Components within that 5ms envelope are conflated into the original sound, while those arriving afterward are treated as reflections/echos. Of note is that those sub 5ms components are perceived as slurred or de-focused when the various frequencies of the arrival transient cannot be combined into a sensible single event. A real acoustic sound source (the finger snap) arrives with all (frequencies of) transients intact and its reflections off the nearby boundaries also intact. The analytical fore-brain figures out / decides the nature of the source (the snap) and the particulars of the walls of the room. And we are very good at it, being necessary for survival.
Trouble comes when aspects of the transient event have been compromised by the reproduction process. Though there are many opportunities to compromise this transient information, the most pervasive is that scrambling introduced by non time/phase coherent loudspeakers where various frequency bands arrive at the ear at different times than a real-life intact signal would. In that case, the auditory brain must analyze the sonic elements and synthesize an opinion of its nature (finger snap). It must also repeat that analysis for each reflection. Those additional layers of decoding are processor-intensive and serve to distance the whole listener from the heard experience. One fine twist is that the more sophisticated the listener, the more he tolerates / succeeds at the cognitive process of figuring out what is being heard. Therefore I trust the aural impressions of non-sophisticated listeners simply because they are in closer contact with the whole, natural auditory experience, whereas the sophisticated listener can "overlook" the deficiencies of a temporally inaccurate sound because his skill enables him to "hear" it despite its shortcomings. Teenage girls are my first choice for test listeners.