Lrsky, we're way past your suggestion. MANY manufacturers of audio components devise simplified controlled test methodologies to evaluate product tweaks and of course during ongoing manufacturing QC. You don't have to reinvent the wheel every time you want to look at a performance criterion. I spent a former professional life devising test methodology for laboratory equipment evaluation and manufacturing processes, culminating in chairing ASTM/ANSI and ISO subcommittees in the late 80s. The subsets centered around piston-operated volumetric-ware: not unlike tiny "tweeters" for liquids, currently used in all labs tosample, measure and move around small (1 uL - 1ml) aliquots of liquids hither and yon. You see these little pipettors (The "Pipetman" is the one I co-invented) on TV as reportes think they're camera-friendly for the general public....
The point is that there is NO DIRECT WAY to measure a uL of volume in a short amount of time! You have to rely on an indirect means, perhaps such as radiometrics, spectrophotometry, or im my developed expertise, gravimetry of water. By knowing a LOT about what happens when you move volumetrically and then weigh tiny amounts of water drops you can very noiselessly measure true volume. That's how nearly ALL the volumetric lab equipment in everly lab in the world was calibrated. And I wrote the friggin methodology. Now doing the ALL the reference measurements for imprecision and bias (barometric pressure, temperature, evaporative blanking, time-clock matching, microgram balance calibrating, operator bias calibrating (these are often hand-held devices), and a few I've plumb forgotten, for EACH string of measurements would be preposterously inefficient. Especially for $100-500 hand-tools. So manufacturers have devise highly-controlled procedures (or at least that's what my publications were supposed to teach them to do!) to shorten these "controls" to only several minutes per device. It was this kind of atmosphere in the late 80s in Geneva that spawned ISO9000/1/2etc. Unfortunately that's become more of a paper-cover set of machinatuions rather than necessarily a raising of quality level. But I digress.
It onlly takes a cursory reading of a few back issues of Speaker Builder et al to uncover manufacturers who've used simple SPL meters in real-world acoustic setups to uncover non-linearities in switched-component analyses via differential testing. I casually mentioned my testing to a chief designer (ex-KEF, now Boston Audio), as well an ex- BBN master acoustician, who implicitly trusted the soundness of matched-reference technique analysis for uncovering non-linearity of a suspected "unorthodox behavior of a gain device". Indeed, when I mentioned my results to VAC's folks, after a cursory description of technique, their concern was NOT my procedure, or its validity, but the degree of nonlinearity of the results, which they said again seemed surprising, even for the admittedly unruly output stage of the AVATAR.
So Lrsky, I kindly suggest that you step back a bit here, as my technical prowess as organizing a valid scientific inquiry is not truly in question...I've got the backing to pass ANY scrutiny of methodology despite not having conventional DIRECT high performance traditional electrical instrumentation, and please not be coy about summoning the industry gods to pass judgement on what anyone of reasonable facility with the scientific method can discern is a well-run set of differential tests. As well, the amplitude of the non-linearity difference to noise ratio is VERY high (although not calculated), so my experience tells me I'm on solid ground with the stat calcs too.
But again, rather than pee back at you, I'd rather educate.
What you simply do NOT understand is that calibrated speakers, room, mic, cable, etc., are NOT needed to run this type of test, as these items are held STABLE through all test runs. Repeat runs (controls) proved that system imprecision was extremely small (which I think you may not have a problem with); indeed the raw data is NOT flat in dB across the 27 or so test frequency points for any situation, because of ANY and ALL of the effects of the system. But they are STABLE as a rock (well within 0.5dB).
Again, the biggest sources of imprecicion are human headmovements and parallax error in eyesight.
This test can, and has been used validly by INNUMERABLE MANUFACTURERS to assess changes in one variable at a time at ANY component position in the system chain. I think that's part of what you don't get.... Whereas there is NO VALIDITY in quoting actual raw data in dB vs frequency because of no pure reference scale calibrations (that and efficiency are the reasons why I didn't post it); because of the stable RELATIVE REFERENCE, there can be great validity in concluding statistical significance from differential calculation of this "humpy data".