I think Hearhere summed up the issue well in his last post, but I would come at it from a slightly different angle. Simply put, DBT is not, in and of itself, "controversial." However, there is a great deal of misunderstanding/ disagreement regarding its use and applicability. More particularly, DBT is simply a tool, the results of which are interpreted based on statistical analysis, and must be understood in that context. While DBT does have some applicability in the audio context, it is not the be-all and end-all that some make it out to be.
There are two main problems with how DBTs are used/viewed by certain audiophiles. First and foremost, what many do not understand (but what anyone with experience in statistics can tell you) is that if there is a non statistically significant result, the DBT has not “proven” there are no differences between conditions! Rather, all that can be concluded is that the DBT failed to reject the null hypothesis in favor of the alternative hypothesis.
Second, small-trial (aka "small-N") listening tests analyzed at commonly used statistical significance levels (e.g. <.05) lead to large Type 2 error risks, thereby masking the very differences the tests are supposed to reveal.
Now breaking that down into English is a pain, but I'll give it a shot (I’m an engineer, as opposed to s statistician - thus any stats guys feel free to correct me). In a simple DBT, one attempts to determine if there are audible differences between two conditions (such as by inserting a new interconnect in a given system). This is more commonly called a hypothesis test - the goal is to determine whether you can reject a "null hypothesis" (there are in fact no differences between the two conditions) in favor of a "conjectured hypothesis" (there are in fact differences between the two conditions).
In a DBT, there are four possible results: 1) there are differences and the listener correctly identifies that there are differences; 2) there are no differences and the listener correctly identifies there are no differences; 3) there are no differences, but the listener believes there are differences; and 4) there are differences, but the listener believes there are no differences. Obviously, 1 and 2 are correct results. Circumstance 3 (concluding that differences exist when in reality they don’t) is commonly referred to as "Type 1 error". Circumstance 4 (missing a true difference) is commonly referred to as "Type 2 error". Put in terms of the hypothesis test stated above, type I error occurs when the null hypothesis is true and wrongly rejected, and type II error occurs when the null hypothesis is wrongly accepted when false.
Now, things get a little complicated. First we need to introduce a variable, p_u, which is the probability of success of the underlying process. In the listening context, this is the probability that a listener can identify a difference between conditions, which is based on the acuity of the listener, the magnitude of the differences, and the conditions of the trial (e.g. the quality of the components, recording, ambient noise, etc). Unfortunately, we can never “know” p_u, but can only make reasonable guesses at it.
We also need to introduce the variable "alpha". Alpha, or the significance level, is the level at which we can reject the null hypothesis in favor of the alternative hypothesis. By selecting a suitable significance level during the data analysis, you can select a risk of Type 1 error that you are willing to tolerate. A common significance level used in DBT testings is .05.
Finally, we need to look at the probability value. In hypothesis testing, the probability value is the probability of obtaining data as extreme or more extreme than the results achieved by the experiment assuming the null hypothesis is true (put another way, it is the likelihood of an observed statistic occurring on the basis of the sampling distribution).
Once the DBT is performed, one compares the probability value to alpha to determine whether the result of the test is statistically significant, such that we can reject the null hypothesis. In our example, if the null hypothesis is rejected, we can concluded there are in fact audible differences between ICs.
Now, here comes the fun part. It might seem that you want to set the smallest possible significance level to test the data, thereby producing the smallest possible risk of Type 1 error (i.e., set alpha to .01 as opposed to .05). However, this doesn’t work, because, as you reduce the risk of Type 1 error (lower alpha), the risk of Type 2 error necessarily increases.
Further, and a greater impediment to practical DBT testing, is that the risk of Type 2 error increases not only as you reduce Type 1 error risk, but also with reductions in the number of trials (N), and the listener's true ability to hear the differences under test. Since you really never know p_u, and can only speculate on how to increase it (e.g., by selecting only high quality recordings of unamplified music using a high quality system to test the ICs), the best ways to reduce the risk of Type 2 error in a practical listening test is by increasing either N or the risk of Type 1 error.
Now for some examples. Let's assume we use 16 tests on the IC in question. For purposes of the example, further assume that the probability of randomly guessing correctly whether the new IC was inserted is 0.5. Finally, we must make a guess at “p_u”, which we could say is 0.7. In this instance, the minimum number of correct results for the probability value to exceed .05 is 12 (our type I error in this case is = 0.0384). However, our type II error in this case goes through the roof - in this example, it is .5501, which is huge! Thus, this test suffers from a high level of type 2 error, and is therefore unlikely to resolve differences that actually exist between the interconnects.
What happens if there were only 11 correct results? Our p value is then .1051, which exceeds alpha. Thus, we are not able to reject the null hypothesis in favor of the alternative hypothesis, since the p value is greater than alpha. However, this does not allow us to concluded that there are in fact no audible differences between Ics. In other words, data not sufficient to show convincingly that a difference between conditions is not zero do not prove that the difference is zero.
So now lets increase the number of trials to 50. Now, the number of correct results needed to yield statistically significant results is 32 (p value = .0325). Assuming again p_u is 70%, our Type 2 error drops to ~ 0.14, which is more acceptable, and thus differences between conditions are more likely to be revealed by the test.
OK, one last variation. Let’s assume that the differences are really minor, or we are using a boom box to test the interconnects, such that p_u is only 60%. What happens to Type II error? It goes up - in the 50 trial example above, is goes from .1406 to .6644 - again, the test likely masks any true difference between ICs.
To sum up, DBT is tool that can be very useful in the audio context if used and understood correctly. Indeed, this is where I take issue with Bomarc, when he says "I don't want to get into statistics, except to say that's usually not the weak link in a DBT". Rather, the (mis)understanding of statistics is precisely the weak link in applicability of DBTs.
There are two main problems with how DBTs are used/viewed by certain audiophiles. First and foremost, what many do not understand (but what anyone with experience in statistics can tell you) is that if there is a non statistically significant result, the DBT has not “proven” there are no differences between conditions! Rather, all that can be concluded is that the DBT failed to reject the null hypothesis in favor of the alternative hypothesis.
Second, small-trial (aka "small-N") listening tests analyzed at commonly used statistical significance levels (e.g. <.05) lead to large Type 2 error risks, thereby masking the very differences the tests are supposed to reveal.
Now breaking that down into English is a pain, but I'll give it a shot (I’m an engineer, as opposed to s statistician - thus any stats guys feel free to correct me). In a simple DBT, one attempts to determine if there are audible differences between two conditions (such as by inserting a new interconnect in a given system). This is more commonly called a hypothesis test - the goal is to determine whether you can reject a "null hypothesis" (there are in fact no differences between the two conditions) in favor of a "conjectured hypothesis" (there are in fact differences between the two conditions).
In a DBT, there are four possible results: 1) there are differences and the listener correctly identifies that there are differences; 2) there are no differences and the listener correctly identifies there are no differences; 3) there are no differences, but the listener believes there are differences; and 4) there are differences, but the listener believes there are no differences. Obviously, 1 and 2 are correct results. Circumstance 3 (concluding that differences exist when in reality they don’t) is commonly referred to as "Type 1 error". Circumstance 4 (missing a true difference) is commonly referred to as "Type 2 error". Put in terms of the hypothesis test stated above, type I error occurs when the null hypothesis is true and wrongly rejected, and type II error occurs when the null hypothesis is wrongly accepted when false.
Now, things get a little complicated. First we need to introduce a variable, p_u, which is the probability of success of the underlying process. In the listening context, this is the probability that a listener can identify a difference between conditions, which is based on the acuity of the listener, the magnitude of the differences, and the conditions of the trial (e.g. the quality of the components, recording, ambient noise, etc). Unfortunately, we can never “know” p_u, but can only make reasonable guesses at it.
We also need to introduce the variable "alpha". Alpha, or the significance level, is the level at which we can reject the null hypothesis in favor of the alternative hypothesis. By selecting a suitable significance level during the data analysis, you can select a risk of Type 1 error that you are willing to tolerate. A common significance level used in DBT testings is .05.
Finally, we need to look at the probability value. In hypothesis testing, the probability value is the probability of obtaining data as extreme or more extreme than the results achieved by the experiment assuming the null hypothesis is true (put another way, it is the likelihood of an observed statistic occurring on the basis of the sampling distribution).
Once the DBT is performed, one compares the probability value to alpha to determine whether the result of the test is statistically significant, such that we can reject the null hypothesis. In our example, if the null hypothesis is rejected, we can concluded there are in fact audible differences between ICs.
Now, here comes the fun part. It might seem that you want to set the smallest possible significance level to test the data, thereby producing the smallest possible risk of Type 1 error (i.e., set alpha to .01 as opposed to .05). However, this doesn’t work, because, as you reduce the risk of Type 1 error (lower alpha), the risk of Type 2 error necessarily increases.
Further, and a greater impediment to practical DBT testing, is that the risk of Type 2 error increases not only as you reduce Type 1 error risk, but also with reductions in the number of trials (N), and the listener's true ability to hear the differences under test. Since you really never know p_u, and can only speculate on how to increase it (e.g., by selecting only high quality recordings of unamplified music using a high quality system to test the ICs), the best ways to reduce the risk of Type 2 error in a practical listening test is by increasing either N or the risk of Type 1 error.
Now for some examples. Let's assume we use 16 tests on the IC in question. For purposes of the example, further assume that the probability of randomly guessing correctly whether the new IC was inserted is 0.5. Finally, we must make a guess at “p_u”, which we could say is 0.7. In this instance, the minimum number of correct results for the probability value to exceed .05 is 12 (our type I error in this case is = 0.0384). However, our type II error in this case goes through the roof - in this example, it is .5501, which is huge! Thus, this test suffers from a high level of type 2 error, and is therefore unlikely to resolve differences that actually exist between the interconnects.
What happens if there were only 11 correct results? Our p value is then .1051, which exceeds alpha. Thus, we are not able to reject the null hypothesis in favor of the alternative hypothesis, since the p value is greater than alpha. However, this does not allow us to concluded that there are in fact no audible differences between Ics. In other words, data not sufficient to show convincingly that a difference between conditions is not zero do not prove that the difference is zero.
So now lets increase the number of trials to 50. Now, the number of correct results needed to yield statistically significant results is 32 (p value = .0325). Assuming again p_u is 70%, our Type 2 error drops to ~ 0.14, which is more acceptable, and thus differences between conditions are more likely to be revealed by the test.
OK, one last variation. Let’s assume that the differences are really minor, or we are using a boom box to test the interconnects, such that p_u is only 60%. What happens to Type II error? It goes up - in the 50 trial example above, is goes from .1406 to .6644 - again, the test likely masks any true difference between ICs.
To sum up, DBT is tool that can be very useful in the audio context if used and understood correctly. Indeed, this is where I take issue with Bomarc, when he says "I don't want to get into statistics, except to say that's usually not the weak link in a DBT". Rather, the (mis)understanding of statistics is precisely the weak link in applicability of DBTs.