Sound and music

Overview

Sound and music

See for yourself

Figure 1 View »

Figure 2 View »

Figure 3 View »

Figure 4 View »

Figure 5 View »

Figure 6 View »

Figure 7 View »

An Empirical Explanation: An Explanation of Pitch and Music Based on the Statistical Structure of Natural Tonal Sounds

Since human speech sounds are the principal naturally occurring sources of tone-evoking stimuli, we examined whether observed consonance ordering is predicted by the statistical distribution of sound energy in the periodic stimuli that human listeners normally encounter.
Just as stimuli in vision conflate the physical factors that give rise to the in the retinal image, so sound stimuli at the ear conflate their generative sources (Figure 1). Just as there is an "inverse optics problem", so the uncertain provenance of sound stimuli presents an "inverse auditory problem" (i.e., how can a listener know the source of a sound when the contributing physical factors are intertwined in the stimulus?). To address this problem we have explored the relationship between the periodicity of sound stimuli and the sensation this iteration gives rise to, which is called pitch. The conceptual framework for this exploration is generally the same as vision, asking whether the phenomenology of perception can be explained by the empirical provenance of the relevant stimuli. Thus the following account concerns pitch perception and music, perceptual phenomenology that has been described in detail.
Figure 1

Figure 1 / Diagram of the "inverse auditory problem", showing the conflation of the physical factors at the ear that the listener nevertheless needs to parse.

With respect to music, it is well known music worldwide is typically composed and performed using a subset of 12 specific frequency intervals (called chromatic scale tones) from any octave within the middle range of human hearing. In an effort to understand the biological basis for these otherwise puzzling tonal preferences in music, we showed that most of the 12 chromatic scale intervals correspond to peaks of relative power in the normalized spectrum of human vocalizations (Schwartz et al., 2003; see also Schwartz and Purves, 2004). Moreover, we found that the relative consonance of chromatic scale tone dyads (i.e., the degree to which a stimulus comprising two chromatic scale tones sounded together is perceived as consonant) varies with the power at the relevant frequency ratio, the intervals corresponding to ratios with more average power being those that are judged more consonant in psychophysical studies. The following paragraphs summarize the theoretical framework and the analytical approach for those interested in getting a better sense of what was done, focusing on consonance in music (review the Demonstration, which will give a clearer idea of what consonance means).
The term consonance refers to the harmoniousness or pleasantness of the sound produced by playing two musical tones simultaneously. Consonance ordering refers to the relative consonance of different chromatic scale tone combinations (Figure 2). For example, the combination of the notes C4 and F4, called a 'fourth' in musical terminology, is typically judged more consonant than the combination of notes C4 and bG4, called a 'tritone'. Figure 3 shows the average consonance rank listeners assigned to each of the 12 possible chromatic scale tone combinations in research conducted during the late 19th and early 20th centuries. The studies all agree that the musical intervals of the octave, the fifth, and the fourth are the most consonant, followed closely by the major sixth, major third; the tritone, minor seventh, major second, major seventh, and minor second are the least consonant. Remarkably, listeners of all ages, places, and historical periods tend to produce this same general consonance ordering, suggesting that it has a basis in some fundamental property of audition. Despite scientific interest in this phenomenon dating back at least 2500 years, this aspect of hearing has no generally accepted explanation in either physiological or psychological terms.
Figure 2

Figure 2 / The intervals of the chromatic scale. A one-octave portion of a piano keyboard indicating the chromatic scale tones, the name of the musical interval corresponding to the combination of each tone with the lowest note in the scale (C in this example) and the fundamental frequency ratio that defines each interval in the Just Intonation tuning scheme.

Figure 3

Figure 3 / The consonance ordering of musical intervals. Graph shows the consonance rankings assigned to each of the 12 chromatic scale tone pairings in the seven studies reported by Malmberg (1918). The median consonance values are indicated by the open circles connected by a dashed line. The octave (e.g., C4 and C5) is judged the most consonant and the minor second (e.g., C and bD) is judged the least consonant.

Most previous efforts to explain why human listeners experience different tone combinations as relatively pleasant or unpleasant have focused on the physical properties of sound stimuli, their frequency characteristics in particular. Helmholtz, for example, equated consonance with the 'smoothness' of a sound, and suggested that a loss of smoothness occurs when the overtones (i.e., harmonics) of two different notes physically interfere with each other. Terhardt has argued more recently that smoothness is only part of the story, and that listeners acquire a sense of musical intervals by learning the abstract pitch relationships in speech sounds. While such theories have been successful in explaining some aspects of consonance, neither can fully account for all the phenomenology of tone perception.
Instead of focusing exclusively on the properties of the sound wave at the ear or on the frequency ratios that define musical tone combinations, we have approached the problem of consonance from the perspective of the auditory system's need to contend with the inherent ambiguity of all acoustical stimuli. A given variation in sound pressure can arise from many different combinations of initiating mechanical force, physical properties of the resonant body or bodies acted upon, and qualities of the medium and structural environment intervening between the source and the listener. This uncertainty presents a quandary, since listeners must direct appropriate behavior toward objects and events that are not specified by the sound pressure variations at the receptor surface. We therefore reasoned that the brain must generate the sounds we hear based upon the relative likelihood of the different possible sources of a given acoustical stimulus.
Since human speech sounds are the principal naturally occurring sources of tone-evoking stimuli, we examined a database of recorded speech sounds in ten languages to ask whether the consonance ordering in Figure 3 is predicted by the statistical distribution of sound energy in the periodic stimuli that human listeners normally encounter. Figure 4 illustrates the method we used to analyze a corpus of recorded American English speech sampled from the DARPA-TIMIT Acoustic-Phonetic Continuous Speech Corpus (Garofolo et al., 1990); Figure 5 presents the 'statistical spectrum' resulting from this analysis, showing the average relative amplitude that occurs in speech sounds at different frequency ratios relative to the maximum amplitude in a spectrum.
Figure 4

Figure 4 / Analysis of speech segments. A) Variation of sound pressure level over time for a representative utterance from the TIMIT corpus (the sentence in this example is "She had your dark suit in greasy wash water all year"). B) Blowup of a 0.1 second segment extracted from the utterance (in this example the vowel sound in "dark"). C) The spectrum of the extracted segment in (B), generated by application of a fast Fourier transform. All amplitude and frequency values in a given spectrum were normalized according to. Fn = F/Fm and An = A/Am, where Am and Fm are the maximum amplitude and its associated frequency, A and F are any given amplitude and frequency values in the spectrum, and An and Fn are the normalized values. This method of normalization avoids any assumptions about the structure of human speech sounds, e.g., that such sounds should be conceptualized in terms of ideal harmonic series.

Figure 5

Figure 5 / Statistical characteristics of American English speech sounds based on an analysis of the spectra extracted from the >100,000 segments (200 per speaker) in the TIMIT corpus. Mean normalized amplitude is plotted as a function of normalized frequency, the maxima indicating the normalized frequencies at which power tends to be concentrated. The plot shows the statistical spectrum for the octave interval bounded by the frequency ratios 1 and 2. Error bars show the 95% confidence interval of the mean at each local maximum.

The frequency ratios at which sound energy is concentrated in speech sounds accord with the fundamental frequency ratios that define the chromatic scale intervals (Figure 6). Moreover, the difference in the amount of sound energy concentrated at each frequency ratio predicts the consonance ordering of chromatic scale tone combinations (Figure 7).
Figure 6

Figure 6 / Comparison of the normalized spectrum of human speech sounds and the intervals of the chromatic scale. The majority of the musical intervals of the chromatic scale (arrows) correspond to the mean amplitude peaks in the normalized spectrum of human speech sounds, shown here over a single octave. The names of the musical intervals and the frequency ratios corresponding to each peak are indicated. The frequency ratios at the local maxima closely match the frequency ratios that define the chromatic scale intervals (see Figure 2).

Figure 7

Figure 7 / Consonance rankings predicted from the normalized spectrum of speech sounds. Median consonance rank of musical intervals (from Figure 3) plotted against the residual mean normalized amplitude at different frequency ratios. Consonance rank decreases progressively as the relative concentration of power at the corresponding maxima in the normalized speech sound spectrum decreases.

These observations support the hypothesis that the perceptual response to periodic sound stimuli is determined by the statistical relationship between ambiguous acoustical stimuli and their various possible natural sources and suggests that other puzzling aspects of tone perception may be explainable in similar terms.

Following up on these observations, we have recently reported that the basis of these phenomena in speech sounds is the ratio of the frequency of the first two formants (Ross et al., 2007).

References

Gaithersburg, MD: US Department of Commerce.Malmberg CF (1918) The perception of consonance and dissonance. Psychol Monogr 25: 93-133.

Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS Dahlgren NL (1990) DARPA-TIMIT Acoustic-phonetic continuous speech corpus [CD-ROM].

Schwartz D, Purves D (2004) Pitch is determined by naturally occurring periodic sounds. Hearing Research 194: 31-46.

Schwartz DA, Howe CQ, Purves D (2003) The statistical structure of human speech sounds predicts musical universals. J Neurosci (23)18: 7160-7168.

Ross D, Choi J, Purves D (2007) Musical intervals in speech. PNAS 104(23): 9852-9857.