Super real will do I suppose. In the case of the system I mentioned, very pretty, but not realistic. Female voices and violins are not sibilant in person. Drummers do not set up their kit so the cymbals are 10 feet in front of the snare.
Obviously, this takes a proper live recording. I find it amusing that engineers of yor do a better job of getting this right.
This system is not beyond help at all. Just a steady roll off from 1 kHz at 1 dB/oct would result is less super-realism, but more accurate sound. Because this system is point source, images will alway be smaller, as if you are seated in the back of the venue. The system has very accurate bass down to about 50 Hz where it starts to lose power. He really needs two 15" subwoofers, another rabbit hole.
Sounding correct in terms of timbre is relatively easy. It is just a matter of correct amplitude response given a loudspeaker with a well designed crossover and phase response. Casting an image is the hard part. You can't know what you are missing until you experience it. It was about 10 years as an audiophile until I heard a system image correctly and another 10 before I could reliably replicate it.
In short, IMHO, it does not have to be perfectly accurate, it just has to be convincing.