It has been said that "the purpose of the ears is to point the eyes." While the ability of the auditory system to localize sound sources is just one component of our perceptual systems, it has high survival value, and living organisms have found many ways to extract directional information from sound. Although perceptual mysteries remain, the major cues have been known for a long time, and careful psychological studies have established how accurately we can make localization judgments. Anyone who wants to generate spatial sound for HCI has to know what influences the human auditory system. This section summarizes the major factors that influence spatial hearing.
One of the pioneers in spatial hearing research was John Strutt, who is better known as Lord Rayleigh. About 100 years ago, he developed his so-called Duplex Theory. According to this theory, there are two primary cues for azimuth -- Interaural Time Difference (ITD) and Interaural Level Difference (ILD).
Lord Rayleigh had a simple explanation for the ITD. Sound travels at a speed c of about 343 m/s. Consider a sound wave from a distant source that strikes a spherical head of radius a from a direction specified by the azimuth angle. Clearly, the sound arrives at the right ear before the left, since it has to travel the extra distance to reach the left ear. Dividing that by the speed of sound, we obtain the following simple (and surprisingly accurate) formula for the interaural time difference:
Thus, the ITD is zero when the source is directly ahead, and is a maximum of when the source is off to one side. This represents a difference of arrival time of about 0.7 ms for a typical size human head, and is easily perceived.*
Lord Rayleigh also observed that the incident sound waves are diffracted by the head. He actually solved the wave equation to show how a plane wave is diffracted by a rigid sphere. His solution showed that in addition to the time difference there was also a significant difference between the signal levels at the two ears -- the ILD.
As you might expect, the ILD is highly frequency dependent. At low frequencies, where the wavelength of the sound is long relative to the head diameter, there is hardly any difference in sound pressure at the two ears. However, at high frequencies, where the wavelength is short, there may well be a 20-dB or greater difference. This is called the head-shadow effect, where the far ear is in the sound shadow of the head.
The Duplex Theory asserts that the ILD and the ITD are complementary. At low frequencies (below about 1.5 kHz), there is little ILD information, but the ITD shifts the waveform a fraction of a cycle, which is easily detected. At high frequencies (above about 1.5 kHz), there is ambiguity in the ITD, since there are several cycles of shift, but the ILD resolves this directional ambiguity. Rayleigh's Duplex Theory says that the ILD and ITD taken together provide localization information throughout the audible frequency range.
While the primary cues for azimuth are binaural, the primary cues for elevation are often said to be monaural. They stem from the fact that our outer ear or pinna acts like an acoustic antenna. Its resonant cavities amplify some frequencies, and its geometry leads to interference effects that attenuate other frequencies. Moreover, its frequency response is directionally dependent.
The figure above shows measured frequency responses for two different directions of arrival. In each case we see that there are two paths from the source to the ear canal -- a direct path and a longer path following a reflection from the pinna. At moderately low frequencies, the pinna essentially collects additional sound energy, and the signals from the two paths arrive in phase. However, at high frequencies, the delayed signal is out of phase with the direct signal, and destructive interference occurs. The greatest interference occurs when the difference in path length d is a half wavelength, i.e., when f = c / 2d. In the example shown, this produces a "pinna notch" around 10 kHz. With typical values for d, the notch frequency is usually in the 6-kHz to 16-kHz range.
Since the pinna is a more effective reflector for sounds coming from the front than for sounds from above, the resulting notch is much more pronounced for sources in front than for sources above. In addition, the path length difference changes with elevation angle, so the frequency of the notch moves with elevation. Although there are still disputes about what features are perceptually most important (for example, see Han), it is well established that the pinna provides the primary cues for elevation.
When it comes to localizing a source, we are best at estimating azimuth, next best at estimating elevation, and worst at estimating range. In a similar fashion, the cues for azimuth are quite well understood, the cues for elevation are less well understood, and the cues for range are least well understood. The following cues for range are frequently mentioned:
Ratio of direct to reverberant sound
Excess inter-aural level difference (ILD)
The physical basis for the loudness cue obviously stems from the fact that the captured sound energy coming directly from the source falls off inversely with the square of range. Thus, as a constant-energy source approaches a listener, the loudness will increase. It is equally obvious that the received energy is proportional to the energy emitted by the source, and that there cannot be a one-to-one relation between loudness and range. Just playing a sound at a low volume level will not, in itself, make it seem to be far away. To use loudness as a cue to range, we must also know something about the characteristics of the source. In the case of human speech, each of us knows from experience the different quality of sound associated with whispering, normal talking, and shouting, no matter what the sound level. The combination of loudness and knowledge of the source provides useful information for range judgments.
Motion parallax refers to the fact that if a listener translates his or her head, the change in azimuth will be range dependent. For sources that are very close, a small shift causes a large change in azimuth, while for sources that are distant there is essentially no azimuth change.
In addition, as a sound source gets very close to the head, the ILD will increase. This increase becomes noticeable for ranges under about one meter. An extreme case is when there is an insect buzzing in one ear, or when someone is whispering in one ear. In general, sounds that are heard in only one ear are threatening and are uncomfortable to listen to. It is particularly important to keep this in mind when designing HCI systems for headphone listening. As we will see, to get the listener to think that the sound is on one side, it is not at all necessary to have all of the sound in that ear and nothing in the other ear.
The final cue listed is the ratio of direct to reverberant sound. As we mentioned above, the energy received directly from a sound source drops of inversely with the square of the range. However, in ordinary rooms, the sound is reflected and scattered many times from environmental surfaces, and the reverberant energy reaching the ears does not change much with the distance from the source to the listener. Thus, the ratio of direct to reverberant energy is a major cue for range. At close ranges, the ratio is very large, while at long ranges it is quite small. Fortunately, this is a relatively easy and effective cue to manipulate for HCI applications.
Primary Source: Department of Electrical and Computer Engineering
University of California - Davis, One Shields Avenue, Davis, CA
Secondary Sources: see Cat-Ears sources
Safer Cycling Through Sound Engineering™