Amir, can you share what constitutes a "properly run listening test" from your perspective? What characteristics are you listening for, specifically?
It wildly varies depending on class of product. On say, a power tweak, I listen for any difference regardless of what it is. If I can distinguish it from not using the tweak, then that is major news by itself.
For testing of distortion, it is best to hear it exaggerated first, and then dial it back. So if you have a low power/high distortion amplifier, first crank it way up and hear the distortion clearly. Then back down the volume control and see at what point that same artifact is no longer there.
For things like speakers, single speaker testing doesn't make sense. Ultimately we don't know how a recoding is supposed to sound like. Research relies on paring at least 4 speakers together and compare them. That way, the bad speaker will stand out as an exception to the rest. Such tests are outside of the means of most audiophiles but a few have tried as I linked to yesterday.
In all cases, deep knowledge of what you are testing, including measurements, is a great help to focus your listening tests. This is very important in hearing lossy compression artifacts for example.
Back to speaker (and headphone listening), selection of content is paramount. You want broad spectrum content that is mostly invariant. That is, it doesn't keep changing. That way you can do comparisons without the content itself changing on you. This is incredibly helpful when I am developing EQ filters to correct response errors. I want to be able to turn the filter on and off and hear the effect. But if the content changes from dumbs to vocals and then the piano, I can't do this.
Something very useful in testing lower powered amplifiers and speakers/headphones is to have a mix of bass and high frequencies. This way, when the bass notes come and demand power, you can listen to not only how they get distorted by the impact on the rest of the spectrum (e.g. brightness as a result of too much harmonic distortion).
Another key is to stick to the same set of tracks and only use them no matter how tired you get listening to them! You learn what parts of them are revealing, saving you time and effort. Throwing a new random piece of music at every new piece of audio you are testing as some reviewers do, is just wrong.
Hopefully this at least partially answers your question. :)