I think the test you gave yourself is too easy 😀 Your point is well taken in regards to trained listeners and MP3. I think a better test is to serve up 10 different tracks which may be MP3 or may be wave, and then test how well listeners do at accurately assessing if the track is compressed or not. I wonder if even the trained listeners will be challenged in that case without a reference.
I didn't give that test to myself. I was challenged on a major forum by an objectivist to be able to tell MP3 from original with him claiming that no one could. At the same time, there had been a challenged on that forum to tell 16 bit content from 24 bit. Content for that was produced by AIX records which is well known for quality of its productions. So to remove any appearance of bias in selection of material, I grabbed the clips from that test and compressed them to MP3. And post those results. The clip was not at all "a codec killer" where such differences are easier to hear.
On the type of test you mention, I am not a fan of them for the reason you mention. It is harder to identify the original vs compressed that way because you have to now know what the algorithm does to create or hide sounds. In other words, is an artifact part of the original content or was it removed.
Our goal with listening tests should always be to try and find differences, not make it hard for people to find what is there. Because once we know an artifact exists, we can fix it. Making the test harder to pass goes counter to that.
That said, I and many others were challenged to such a test on the same major site above. We were given a handful of clips and asked to find which is which. Results were privately shared with the test conductor. When I shared my outcome, he told me I did not do all that well! I was surprised as I was sure two of the clips were identical and thought that was put in there as a control.
Fast forward to when the results are published and wouldn't you know it, I was "wrong." We had a regular member with huge reputation for mixing soundtracks for major films and he got it "right." Puzzled, I performed a binary comparison and showed that the two files were identical! Test conductor was shocked. He went and checked and found out that he had uploaded the same file twice! He declared the test faulty and that was that.
Despite that, as you saw, I will repeat again, I don't want to make blind tests too hard on purpose. We need to be interested as much in positive outcomes as negative.