The short answer is, the ear doesn’t hear small differences in arrival times at low frequencies.
Here’s a much longer answer:
It makes intuitive sense to line up everything up so that the arrival time for the sub(s) is the same as for the mains, but the real world is more complicated. What we overlook is, the effect of the phase response.
Let me give a fairly simple example: Suppose we have a 4th order crossover at 80 Hz (maybe 4th order lowpass filter on the subs, maybe 2nd order acoustic rolloff + 2nd order highpass filter on the mains). With a 4th order crossover the lowpass and highpass sections are theoretically "in phase" at the crossover frequency, but the lowpass section (the subwoofer) is lagging the highpass section (the mains) by 360 degrees... one full wavelength. In order to align their arrival times, the subwoofers would need to be one wavelength at 80 Hz closer to the ears. That’s fourteen feet! (This same principle holds up for shallower slope crossovers and for asymmetrical crossovers, the fraction of a wavelength is less but more math is involved which is why I picked 4th order for this example.)
The ear’s poor time-domain resolution at low frequencies comes to our rescue. We don’t notice that the output from the subs is arriving one wavelength (fourteen feet) later than it should. I’m not saying there would be no subjective improvement from correcting that, but it’s not "what matters most"... which is a topic I’ll come back to later.
It also makes intuitive sense for the output of all of the subs to arrive at exactly the same instant. This is inherently accomplished if you only have one sub, and can be accomplished at one listening position if you have more than one sub. However if arrival time was what mattered the most in the bass region, then one sub would be what sounds the best, especially outside of the sweet spot.
The reason arrival time isn’t what matters most is, the ear has very poor time-domain resolution at low frequencies. This is why we are so poor at hearing the direction of very low frequency sine waves in a room - we cannot separate the first arrival from the reflections. But this makes the ear very forgiving of small timing errors at low frequencies.
A very illuminating study was conducted in which short-duration low frequency signals - including mere fractions of a cycle - were digitally created and played over headphones (to avoid room effects). Listeners were UNABLE to even DETECT the presence of bass energy from less than one full wavelength. Consider how long wavelengths are at low frequencies and you’ll see that, unless your room is very large, by the time you BEGIN to hear the deep lows, that energy has already reflected off of multiple room surfaces. In this context, a difference in subwoofer arrival times which amounts to a tiny fraction of a wavelength is inconsequential.
(Something which to the best of my knowledge has not been studied is what the time-domain resolution is for the tactile -"felt" with the body rather than heard with the ears - perception of bass energy. I would guess that the time window for "perceptually simultaneous" impact is related to receptor nerve and/or neuron firing rates, which I have not studied.)
Of far greater perceptual consequence is what’s happening to the trailing edge of the bass tones... how smoothly do they fade away? (The first time I encountered a designer giving precedence to the trailing edge over the leading edge was Jon Dahlquist. Jon wrote that, in the course of designing the legendary DQ-10, he had to choose between aligning the leading edges of waveforms, or aligning the trailing edges of waveforms. Listening tests led him to the counter-intuitive conclusion that the trailing edge of the notes mattered more. So this concept is applicable elsewhere in the spectrum, but it is especially applicable at low frequencies in the size listening rooms we have in our homes.)
Well it turns out that speakers + room = a minimum phase system at low frequencies (this according to multiple researchers, including Floyd Toole and Earl Geddes), which means that the time-domain response tracks the frequency response. So where there is a frequency response peak, that’s where the energy takes longer to decay into inaudibility (not that it necessarily decays slower; but because it starts out louder, it takes longer to finish decaying).
I said I’d come back to "what matters most" in the bass region, and imo it's the in-room frequency response. This is predicted by equal-loudness curves, which bunch up south of 100 Hz. They tell us that a 5 dB difference at 40 Hz or so is perceptually comparable to a 10 dB difference at 1 kHz! No wonder big in-room peaks in the bass region are so detrimental to sound quality.
So to recap, the ear has very poor time-domain resolution in the bass region, but has exaggerated sensitivity to frequency response errors in the bass region.
The good news is, the ear really appreciates any improvements we can make in the bass region, whether they be by way of EQ, bass trapping, a distributed multi-sub system, or just working with positioning... or any combination thereof. The information I’ve seen leads me to believe that a distributed multisub system usually makes a bigger improvement than these other techniques in an already competent system, but I’m hardly a disinterested observer.
Sorry this is probably a much longer answer than anyone was looking for.
Duke