How can a 40 watt amp outshine a 140 watt amp
So, I've resisted responding here for quite a while, but temptation has overcome me. My first thought is that the challenge is pretty silly, in that it relies on crippling the better amp, or jacking around with the lesser amp to make the sounds least able to be discriminated. What's left is exactly what most listeners don't care about, which is whether the sound fits some arbitrary standard, rather than whether it sounds lifelike.
Several comments above have made similar points, so I'll add another observation a little more technical. The test requires 24 judgements to be correct. If we are willing to assume that each judgement is statistically independent (arguable, but not terribly germane), then the probability of passing the test if you can detect exactly no difference between the amps is roughly .00000006, a pretty stringent test.
That is, if the probability of choosing the better amp is exactly .5 (we are just flipping a coin to make our choice), then the probability of passing the test (by chance) is less than .0000001. Let's call the probability of detecting a difference on any given trial "p", and the probability of passing the test "P". In our example, p=.5 and P<.0000001, if there is exactly no difference between the amps. Now, suppose there is a small, but hard to detect difference between the amps. Since we have to introduce a variable source signal (music), we cannot just compare one sine wave signal to another, and we are unable to compare the amps to each other with 100% accuracy. The music thus introduces uncertainty into the comparison. If this uncertainty is large, or the difference between the amps is small, p will be near .5 (might as well flip a coin). If the uncertainty is small, and the difference between the two amps is large, then p will approach 1.0. Note that the challenge forces P to equal 1.0. In other words, the challenge is based on the assumption that ANY difference in amps should make it possible to detect a difference in EVERY case. Looking at it from this point of view, the fact that the challenge has never been overcome is just a statistical artifact of the design of the challenge. For those who have had a stat course, the design has almost no statistical power when the signals from the amps are pretty close, or the uncertainty introduced into the signal by the music is large. The design is strongly (!) biased in favor of the null hypothesis of no difference.
Suppose we allow there to be some difference between the amps, but not enough to be detected every time, say, a p value of .6, meaning that we only can detect the difference about 60% of the time. Now, P (the probability of winning the challenge) is less than .00001, still very unlikely. But notice that there is a real difference between the amps. It's obscured by our jacking around with the signals, our confusion induced by the variability of the music, and the fact that we require perfect performance on each trial, but the difference between the amps is still very real.
What situation would lead us to be able to pass the test more often than not? We would have to be able to detect the difference on every trial more than 97% (p greater than .97) of the time--an extraordinary level of performance for an ambiguous stimulus.
The bottom line is that the challenge is primarily a statistical artifact based on the fallacy of accepting the null hypothesis. We cannot conclude that there is exactly no difference between the amps, because we can never prove that p is exactly .5. All we have proven is that we can set up an experiment with enough ambiguity, and so little statistical power, that the result is a foregone conclusion. The prize money is safe for quite some time.
So, I've resisted responding here for quite a while, but temptation has overcome me. My first thought is that the challenge is pretty silly, in that it relies on crippling the better amp, or jacking around with the lesser amp to make the sounds least able to be discriminated. What's left is exactly what most listeners don't care about, which is whether the sound fits some arbitrary standard, rather than whether it sounds lifelike.
Several comments above have made similar points, so I'll add another observation a little more technical. The test requires 24 judgements to be correct. If we are willing to assume that each judgement is statistically independent (arguable, but not terribly germane), then the probability of passing the test if you can detect exactly no difference between the amps is roughly .00000006, a pretty stringent test.
That is, if the probability of choosing the better amp is exactly .5 (we are just flipping a coin to make our choice), then the probability of passing the test (by chance) is less than .0000001. Let's call the probability of detecting a difference on any given trial "p", and the probability of passing the test "P". In our example, p=.5 and P<.0000001, if there is exactly no difference between the amps. Now, suppose there is a small, but hard to detect difference between the amps. Since we have to introduce a variable source signal (music), we cannot just compare one sine wave signal to another, and we are unable to compare the amps to each other with 100% accuracy. The music thus introduces uncertainty into the comparison. If this uncertainty is large, or the difference between the amps is small, p will be near .5 (might as well flip a coin). If the uncertainty is small, and the difference between the two amps is large, then p will approach 1.0. Note that the challenge forces P to equal 1.0. In other words, the challenge is based on the assumption that ANY difference in amps should make it possible to detect a difference in EVERY case. Looking at it from this point of view, the fact that the challenge has never been overcome is just a statistical artifact of the design of the challenge. For those who have had a stat course, the design has almost no statistical power when the signals from the amps are pretty close, or the uncertainty introduced into the signal by the music is large. The design is strongly (!) biased in favor of the null hypothesis of no difference.
Suppose we allow there to be some difference between the amps, but not enough to be detected every time, say, a p value of .6, meaning that we only can detect the difference about 60% of the time. Now, P (the probability of winning the challenge) is less than .00001, still very unlikely. But notice that there is a real difference between the amps. It's obscured by our jacking around with the signals, our confusion induced by the variability of the music, and the fact that we require perfect performance on each trial, but the difference between the amps is still very real.
What situation would lead us to be able to pass the test more often than not? We would have to be able to detect the difference on every trial more than 97% (p greater than .97) of the time--an extraordinary level of performance for an ambiguous stimulus.
The bottom line is that the challenge is primarily a statistical artifact based on the fallacy of accepting the null hypothesis. We cannot conclude that there is exactly no difference between the amps, because we can never prove that p is exactly .5. All we have proven is that we can set up an experiment with enough ambiguity, and so little statistical power, that the result is a foregone conclusion. The prize money is safe for quite some time.