Techniques for reducing digital audio data rates were first applied to early digital telephone systems in the 1960s.
Originally, these systems afforded only small savings in data rates—around 2:1 at best. Recent, more powerful algorithms permit greater data reduction by producing outputs with recovered data streams that differ from the original bit-for-bit, but sound essentially the same as the original to most listeners. This is the approach adapted for Web audio. The resulting algorithms are not transparent, but efficiently preserve reasonable audio quality while greatly reducing data rates.
You might wonder why a general file-compression method such as PKZIP could not be used for audio. These lossless systems analyze the data from a purely statistical perspective, reducing data rate "by the numbers." They must be "lossless," preserving a bit-for-bit input-to-output fidelity. Today's lossy digital audio algorithms, or "perceptual coders," are designed only for audio files, reducing data rate based on how the signal will be heard. They exploit the human hearing sense's inability to detect certain kinds of signal loss, thereby substantially reducing the data rate of digital audio signals.
The process is normally performed in two distinct steps. First, analog audio is converted to 16-bit linear data, typically using one of the standard sampling rates of 32, 44.1, or 48 kHz. Stereo signals are usually summed to mono, producing a digital signal with a data rate in the 500 to 750 KB/second range. The Web-audio data-compression algorithm is then applied in the digital domain, reducing the data rate by a ratio of 50:1 or more. This produces data rates in the tens of kilobits per second.
These algorithms draw on psychoacoustics—the study of human aural perception. A fundamental tenet of psychoacoustics is "spectral masking," in which the presence of one audio signal overshadows (or "masks") perception of other, lower-level signals at nearby frequencies.
Using digital signal processing in the frequency domain, perceptual coders eliminate "unnecessary" parts of the audio signal that are masked by other, louder parts, thereby reducing data-rate requirements. The perceptual coder can also selectively reduce the resolution used to encode the remaining (unmasked) audio signals, further reducing data rates. Whenever digital audio resolution is reduced, noise and distortion will rise. But as long as the algorithm keeps these artifacts below the masking threshold, they remain inaudible.
The data rate of a digital audio signal is the product of sampling rate multiplied by the sampling resolution multiplied by the number of audio channels. For example, the CD format uses a 44.1-kHz sampling at 16-bit resolution for stereo (two channels), which produces a 1.4 MB/second data rate. To reduce this rate, at least one of the product's multipliers must be reduced itself.
Lowering the sampling rate will lose high frequencies of the sound, so not much change can be made there without affecting fidelity. Stereo can be dropped to mono for an immediate halving of the data rate, however. Then the resolution factor-—which affects dynamic range (noise and distortion) can be tackled. Smart coders using perceptual algorithms squeeze the most bits out of a digital audio signal in this area. The most popular methods employed by today's Internet audio systems are MPEG-2 Layer 3 and Dolby's AC-3.
Table 2 shows the relationship between the bit rate of the compressed audio, the ratio of original-to-compressed audio, and the disk-space requirements.