Why Do So Many Audiophiles Reject Blind Testing Of Audio Components?


Because it was scientifically proven to be useless more than 60 years ago.

A speech scientist by the name of Irwin Pollack have conducted an experiment in the early 1950s. In a blind ABX listening test, he asked people to distinguish minimal pairs of consonants (like “r” and “l”, or “t” and “p”).

He found out that listeners had no problem telling these consonants apart when they were played back immediately one after the other. But as he increased the pause between the playbacks, the listener’s ability to distinguish between them diminished. Once the time separating the sounds exceeded 10-15 milliseconds (approximately 1/100th of a second), people had a really hard time telling obviously different sounds apart. Their answers became statistically no better than a random guess.

If you are interested in the science of these things, here’s a nice summary:

Categorical and noncategorical modes of speech perception along the voicing continuum

Since then, the experiment was repeated many times (last major update in 2000, Reliability of a dichotic consonant-vowel pairs task using an ABX procedure.)

So reliably recognizing the difference between similar sounds in an ABX environment is impossible. 15ms playback gap, and the listener’s guess becomes no better than random. This happens because humans don't have any meaningful waveform memory. We cannot exactly recall the sound itself, and rely on various mental models for comparison. It takes time and effort to develop these models, thus making us really bad at playing "spot the sonic difference right now and here" game.

Also, please note that the experimenters were using the sounds of speech. Human ears have significantly better resolution and discrimination in the speech spectrum. If a comparison method is not working well with speech, it would not work at all with music.

So the “double blind testing” crowd is worshiping an ABX protocol that was scientifically proven more than 60 years ago to be completely unsuitable for telling similar sounds apart. And they insist all the other methods are “unscientific.”

The irony seems to be lost on them.

Why do so many audiophiles reject blind testing of audio components? - Quora
128x128artemus_5
Thank you, dletch2, for your post. There is much subtlety buried in these questions of perception. For instance, what is the difference between preference and liking? Liking (the "good" pole of the bipolar Semantic dimension of Valence) is always a "good" thing (pun intended). But, what if two products are equally well liked (i.e., equally "good") AND different in other apparent semantic qualities such as strength, arousal, or novelty? Knowing which one is preferred helps pick one. But, unless you measure these other semantic dimensions you don't know why it's preferred. I come at this indirectly by having my subjects rate their imagined "ideal" product, which gives me target values on each semantic dimension to shoot for. But, I have to assume that my subjects are familiar enough with the products to know what "ideal" looks, sounds, feels, tastes, or smells like for them. In practice, does not seem to have been a problem.

I will also add that my interest is not in knowing whether people can detect a difference between products, but where in the multidimensional perceptual space any given product lies. Products that are close to one another in perceptual space are more similar to one another than products that are farther apart. Here, the words of my graduate advisor, Dr. David Lane, come to mind: statistics can tell you whether two stimuli are reliably different, but they can't tell you whether that difference makes a difference to your target audience. There, you have to know something about preference.

As for fatigue, my subjects evaluate products one at a time. So, they give each product their full attention. But, I evaluate many products in the same study (typically 15 to more than 30), so there is a possible element of fatigue. I handle that with counterbalancing. Across subjects, each product appears equally often at the beginning, in the middle, and at the end of the presentation sequence.

Again, thanks for your comments, dletch2! They make me think.
@ OP - I have not read the whole thread and apologize for any repetitive content.
The use of terms as "ABX" testing" and "blind testing" are generalizations that do not allow to assess the appropriateness of certain methods for the verification of claims or the existence of phenomena. As usual, the devil is in the detail, and we have to look at the specific design of a study and the underlying hypothesis before we can judge the quality and usefulness of a given study design and method.
This is particularly true when we want to establish that a subjective preference is real or the result of bias.

If, for example, a person claims that a cable or a fuse makes a clear, audible difference in the sound of a system, an appropriate simple study design would be a along the lines of (1) this one person (2) on 12 consecutive days listens to (3) the same program on the same system in the same room, comparing the claimed superior component with the same standard component on each day. The test subject is 'blind' to the active component and is asked to identify which one is active. From this we would learn whether the audible difference indeed 'exists' for this person making that very claim. We would also learn something semi-quantifiable, i.e. whether this is marginal at best or "crystal clear". For a clear effect (often touted as dramatic or transforming) this study would be statistically powered with 12 data points. It would be objective.

If, on the other hand we don't want to show the mere existence of a phenomenon ("I can hear a difference when measurements don't detect a difference."), yet we want to determine a preference which holds true for many people, we need more test subjects.

Yet, to articulate a preference when the existence of an audible difference between two components cannot be established in the first place (specifically, when the very person claiming the existence of a preference cannot reliably differentiate between two components) seems unreasonable or arbitrary.

Having said that, when describing components as "synergistic" without being able to not only establish their discrete effect experimentally, but also quantify it, the use of this term seems baseless. In order to establish synergy, one needs to be able to detect AND quantify. And when a company uses the name "Research" in their name and claims there are no methods to measure or test critical product performance parameters, and also does not have any data applying appropriate blind listening data (see above), I am extremely skeptical. In fact, I am not interested.
Post removed 
Post removed