The file size of an iTunes song is only a fraction of its CD counterpart because it has been compressed discarding data in the process. Is the audio quality degraded? More than a year ago I devised a test to see if we can hear differences between the original and the compressed version. Here are the conclusions after analyzing data from more than 500 test takers.
This article is about empirical results. I will present them in a moment, but first allow me to rebuild the context in which they were obtained in order to interpret them correctly.
Let’s make sure we all understand the words ‘lossy’ and ‘lossless’ in the context of audio in the first place. Assume a CD is a faithful copy of the music recorded at the studio. If we want to transfer the music it contains to a portable device like our phone or mp3 player, we ‘rip’ the CD to our computer using a software application like iTunes, a popular example. By default, when the CD content is transferred to your music library, iTunes will significantly compress audio files to a fraction of its original size to save storage space, discarding part of the original data in the process. We call this a lossy compression method of codification—or lossy codec for short. If you want to preserve all the data contained in the CD, you may choose a lossless codec. FLAC and Apple lossless are good examples of the latter. No data will be discarded in the process, but music files will take up much more space.
Why is a lossy codec the default option in iTunes? Aren’t we sacrificing fidelity for convenience? Well, the truth is lossy codecs are really smart at deciding what data to discard. They are based on knowledge about how our ears and brain perceive what we hear—the science of psychoacoustics. When using a good lossy codec like iTunes plus, experts are confident that we will not notice any sort of degradation in the music. But will we?
The moment we read ‘lossy compression discards data to reduce file size’ we can’t help but think something bad is happening. However, most of the music we have access to through streaming and the Internet is compressed using a lossy codec—iTunes and other music services compress files to about 1/6th of their original CD size. That is quite a significant amount of data thrown away in the process. Shouldn’t we be able to detect the loss? It seems reasonable to think that we should, given the ratio.
To address the issue of compression vs. quality, I set up an online test that compared CD audio vs. AAC 256k VBR, the high-quality lossy codec used by iTunes. I wanted to offer people the chance to decide for themselves if discarded data necessarily meant degraded sound quality, and I could also gather valuable information that could reveal some interesting facts. Part of the motivation for this project came from what I considered unjustified (and not always unbiased) criticism on the supposed low quality of lossy codecs to justify the need for solutions like High Resolution Audio (HRA) or initiatives like Pono. Many of their enthusiasts boast that lossy and lossless compare as ‘night and day’, and some even go as far as considering CD audio ill-qualified for audiophile ears.
The test was available online for almost a year, and I was able to collect 580 submittals of results from a varied population of test takers to whom I am deeply grateful. They started by answering a brief survey about their age, gender, whether they had any musical training or not, the quality of the audio equipment they would use to take the test, and their location. Then they went on to the blind test. They had to choose a musical clip and listen to a series of 16 trials composed of two sections, A and B, one being CD quality and the other, AAC 256k VBR. In each case they had to decide whether A or B was CD quality. They were offered a way out at the 8th trial if they got tired, but most participants went through the sixteen trials, providing more accurate statistics. There is an offline version of the blind test available on this site if you are interested in taking it yourself.
Meet the test takers
It was nice to see that people of all ages were engaged with the challenge, showing that we do care about how to get the best experience possible out of the pleasure of listening to the music we love.
Our ability to perceive high frequencies decreases with age. I thought it would be interesting to see if the young performed better in the blind test. We are about to find out.
Looks like the issue is mostly a male thing.
Could any sort of musical training lead to better performance?
If the differences in the audio quality of the formats were expected to be really subtle as I was convinced, not only the listener’s ears but also the audio equipment used would be a key factor in the results.
As I said, I offered a choice of music styles to make the blind test more engaging: rock, jazz, blues, soundtrack, classical… I was glad to see that all six clips got finally tested, although it was fairly easy to predict that #4 would be the most popular…
One of the nice outcomes of this project has been reaching so many corners of the world. I have generated a word cloud with most of the locations were the test was taken (click to zoom).
O.K., so, what does this large and varied sample of people say about our ability to tell CD and iTunes plus apart? As I said earlier, the audible differences between the two formats are so subtle that we may state the following
NULL HYPOTHESIS: The people who took the test performed no better than they would if they had chosen their answers at random.
Do the data obtained provide enough evidence to reject this hypothesis?
We proceed as follows: we gather all the scores and compare their frequencies (the number of people who obtained that score in the corresponding category) with the frequencies we would expect if people were just picking their answers at random. Then we analyze the deviations from the expected results using a very common statistical proof: the chi-squared test. This test tells us the probability of obtaining such a distribution if the null hypothesis is true (p-value). A p-value of less than 0.05 is commonly required to reject the null hypothesis. In our case, a sufficiently low p-value would provide statistical evidence that people could tell which format they were listening to.
In the following charts, green bars indicate the frequencies obtained and brown bars, the frequencies expected, for each value of the score. There are two charts in each category, one for each version of the test: full (16 trials) & shortened (8 trials). The p-value of the chi-squared test is indicated in each case.
Notice that, despite deviations, both distributions have similar bell shapes. Furthermore, all reliable p-values are in favor of the null hypothesis stated, some of them in high agreement. So, based on the data obtained, the most reasonable conclusion is that we can’t hear the difference between CD audio and iTunes plus. And this is true in all the cases considered—being young, with our sense of hearing at its peak, having musical training or using excellent audio gear doesn’t seem to help.
Do these results mean there are no exceptions to the rule? Of course not. In fact, one early participant from Limmerick PA took the test twice and got a score of 15/16 in each try. Such a result is highly improbable, and suggests he was indeed able to spot the differences in the samples he tested (a chance of 1 in 10,000,000 for 30 successes in 32 trials!). So we may have found one single pair of ‘golden ears’, after all, among the test takers! This is what he wrote as a comment in his second try:
This one was much harder to differentiate than sample 6. Honestly all the other samples I do not think I could tell them apart. They are too narrow and lack the sharp percussive elements that are required in the sampled portion. I’m also not as confident on my ability to discriminate on this sample vs the previous one. There are very short sections which I know are different, in each version, but I’ve been at this for almost 2 hours. So I am just going to stop here.
Here we have somebody with trained ears who knew were to look for differences and (presumably) used outstanding audio equipment, and still the challenge demanded from him a great deal of concentration. To me, his comment makes quite a case in favor of the high quality of the lossy codec used.
There were some other examples of good performance. Scores of 12/16 or greater (and to a lesser extent 7/8 or greater) are a good start in favor of your ability to discriminate (p<0,05). But to prove you can, you must get high scores repeatedly for consistency. A participant from Thunder Bay, Canada, the only one who took the test three times, got scores of 12/16, 9/16 and 7/8. The probability of getting this result or better by chance is as low as 8 in 1000, but results like this can be expected in a sample of 580. A few more participants who also got high scores either failed to perform sufficiently well in a second try, or simply didn’t take the test more than once.
Finally, low scores are also worth considering as a sign of discrimination, because they are as rare as high ones. Anyone getting scores equal or less than 4/16 or 1/8 repeatedly would be proving he or she can discriminate, but with the odd outcome of considering the lossy codec higher quality than CD. In any case, no consistent low scores were observed in the sample.
The bottom line
AAC 256k VBR codec delivers excellent compression quality that will suffice for most of us. If you think you are a rare bird with golden ears, test yourself blind to know if you truly are before worrying about the quality of lossy compression codecs. Odds are in favor that you’ll be disappointed.
Questions and comments welcome!