Consonant and Dissonant Sounds

T has gotten really into music in the last month or so. Lately, when he hears music he likes, he’ll start dancing. We went out to eat a few weeks ago, and the restaurant had music playing in the background, and T was dancing up a storm in his high chair – it was adorable! And, I especially love when T dances when I play the piano!

T’s evident love of music made me start thinking about babies and qualities of music that they may innately appreciate. A little reading led me to a few studies (for example, this one by Trainor and Heinmiller) that have shown that even infants that are just a few months old prefer pairs of musical notes played together (a musical interval) that are consonant (pleasant sounding) rather than dissonant (harsh or unpleasant). What’s interesting to me about these studies is that babies seem to recognize and prefer musical intervals widely recognized by adults (both musicians and non-musicians) as being consonant, even without much music-listening experience. Here’s a YouTube video that gives examples of consonant (for example, an octave, a perfect fourth, a perfect fifth, etc.) and dissonant intervals (a minor second, a major second, etc.) – the difference between consonant and dissonant intervals is really striking, even if you don’t know the names of the intervals!

The study I linked to above showed that infants prefer listening to consonant intervals rather than dissonant intervals. And, this study (by Sugimoto et al.) showed than even an infant chimpanzee preferred listening to consonant intervals rather than dissonant intervals. So this seems to suggest that there’s something hard-wired in our brains that makes us prefer consonant musical intervals, even if we haven’t heard much music or had any musical training.

So, I decided to test T to see if he has a preference for consonant musical intervals over dissonant intervals! I played different examples of consonant and dissonant sounds for T, both on an iPad and on the piano to see if he had different reactions (this was not the protocol used in any research study, but it was the best I could do at home :)). When I played consonant and dissonant intervals on the piano, I got no noticeably different reaction from T, although, this may have been because he was preoccupied with trying to lick the fan. I then played consonant and dissonant sounds three different times on the iPad. Two of the times, he started smiling when he heard the consonant sounds and reaching for the iPad, and when he heard the dissonant sounds, he turned away from the iPad and even seemed a little visibly distressed (although this may have been because my experiment was running into snack time; I’m beginning to understand why research studies with babies often have a high attrition rate). The third time got no difference in reaction. So, it seem possible that T has a preference for consonant sounds over dissonant ones, but I can’t really be sure based on this.

One thing that piqued my curiosity is that these studies were done in normal hearing babies (and a normal hearing chimpanzee), so I wondered whether people with hearing loss hear consonant and dissonant intervals different than normally-hearing people. I found this study (Tufts, et al.) which showed that people with hearing loss do hear consonant and dissonant intervals differently, and their explanation of why was so interesting to me!

Before I talk about what Tufts et al. found, here’s a quick explanation of why different intervals are heard as consonant or dissonant. The big difference between different intervals is how far apart the two notes are – for example, a minor third is 3 semitones apart (an example is C and E-flat) and an octave is 12 semitones apart. I think that the accepted theory for why an interval sounds consonant is that the component notes of the interval are far enough apart to be easily resolved by the cochlea – one technical way to say this is that the two notes fall into different auditory filters. Here’s a picture I made to go along with an analogy:


Imagine you’re rolling balls down a hill into different buckets (I have no idea why you’d be doing this, but just go along with it!). How far apart two balls are at the top of the hill represents a musical interval, and each bucket at the bottom of the hill represents an auditory filter. If two balls fall into the same bucket, the interval composed of those two “balls” (notes) will sound dissonant, whereas if they fall into different buckets, they’ll sound consonant. People with hearing loss are known to have broader auditory filters – that is, the buckets at the bottom of the hill are a lot bigger, so more balls would fall into them. So, based on this theory, you’d predict that people with hearing loss would find closely spaced intervals dissonant that people with normal hearing would find more consonant.

And, to some extent, this is what Tufts et al. found – here are two cool plots (FIGS. 3 and 4):


From Tufts, et al. – Top – FIG. 3 of Tufts, et al – Consonance/Dissonance scores for Normal Hearing adults. Bottom – FIG. 4 of Tufts, et al. – Consonance/Dissonance scores for Hearing Impaired adults.

FIGS. 3 and 4 of Tufts show people’s consonance/dissonance ratings for different musical intervals (indicated by the ratio of the frequencies) – as you can see from FIG. 3 – people with normal hearing say that notes with a ratio of 1 (this is the “unison interval,” or the same note!) is very consonant, and then there’s a steep drop where very closely spaced notes (a frequency ratio between 1.0 and 1.1) sounds VERY dissonant (from the analogy above, where balls are landing in the same bucket), and then as the notes get farther apart, the intervals become more and more consonant (analogy – balls more likely to land in different buckets) as the interval approaches an octave (a frequency ratio of 2.0). While the pattern is similar for people with hearing loss (the bottom figure), the curves look slightly different – if you look at the solid line on the bottom figure, the curve is flatter and has a negative peak later, which shows that people with hearing loss generally rated all intervals as sounding less consonant (because the curve is flatter overall), and that the most dissonant intervals to people with hearing loss were ones that were already “recovering” in consonance for people with normal hearing.

This may explain why people with hearing loss often say that music doesn’t “sound right” to them, even if they wear hearing aids or cochlear implants – the musical intervals that people with normal hearing find pleasant and that are therefore used heavily in music may sound harsh and unpleasant to people with hearing loss.

(By the way, the Tufts, et al. study was done with adults with mild-moderate hearing loss; I have no idea what you would find for babies with hearing loss, which I think is a really interesting question!)


Singing to Babies

(Note: a lot of the research with infants I’ve been writing about has been done with normally-hearing infants. Although there’s a lot of great research on children with Cochlear Implants, I’m finding that there’s less research on children with mild hearing losses, especially for infants and in interesting areas like music. So, I end up writing about studies that have been done with normally-hearing infants, and I’m really not sure how they translate!)

It’s been clear since T was just a few weeks old that he loved hearing me sing (this was a surprise to me, since pretty much no one else enjoys hearing me sing :)). Since then, I’ve sang to him A LOT – I sing when I play with him, when he’s cranky in the stroller or carrier, lullabies at bedtime, etc. I started wondering what research has been done on singing to babies, so I did a little searching.

One interesting question is whether/how we change how we sing when we sing to a baby compared to singing to an adult or to no one. It’s pretty obvious that people talk to babies differently than they talk to adults – the typical “baby-speak” is called “infant-directed speech” and it has a lot of potential benefits for babies to acquire language. Infant-directed speech is usually characterized by slower speech, repetitiveness (e.g., “look at the doggie! the doggie says woof! hi, doggie!”), higher pitch, and more pitch variation (i.e., MUCH less monotone than when talking to adults). This probably helps babies learn new words and understand the structure of language (e.g., the concept of phrases and sentences) by helping them focus their attention on particular words or groups of words by repeating and emphasizing them.

But do adults sing differently to babies than they do in their absence (even when singing the same song)? This is an especially interesting question, because many of the ways we change our speaking when it’s directed to a baby aren’t easily done in singing – for example, a particular song has constraints on pitch (based on the tune of the song) and rhythm, so it’s harder to vary pitch and rhythm when we sing and still maintain the song. But,it turns out that, similar to infant-directed speech, adults sing differently when the singing is directed to a baby! Studies ([1] and [2]) have shown that when we sing to babies (rather than singing the same song in their absence), we sing with a higher pitch (just like in infant-directed speech) and with a slower tempo (also like in infant-directed speech). And, even though mothers tend to sing more to their babies than fathers, fathers show the same pattern of singing in a higher pitch and with a slower tempo, so there may be something intrinsic about the characteristics of infant-directed singing. (See [2])

And, it turns out that the way we change our singing when it’s directed to a real, live baby and not just an empty room is pretty robust – adults listeners are really good at identifying instances in which another adult was singing to a baby rather than to an empty room – that is, which songs were “infant-directed.” (See [1]).  The adult listeners tended to say that the “infant-directed songs” were sung with a “more loving tone of voice.” ([1]).

I tried to think of whether I sing differently when I’m singing to T than when I’m singing by myself, and it’s hard to say, mostly because I rarely sing if it’s not to T 🙂 But, I wouldn’t be surprised if I sound more loving when I’m singing to T than when I’m singing by myself!


On a totally unrelated topic, another interesting study that I came across looked at the effects of moms singing on their babies’ arousal levels as measured with cortisol in saliva ([3]).  They found that babies (averaging 6 months in age) who had relatively low cortisol levels initially (for example, if they weren’t paying attention to anything in particular and were sort of just chilling) had an increase in cortisol after their mom sang to them for 10 minutes – that is, they were more aroused after hearing their mom sing and more in a “playtime” state. Conversely, babies who had higher initial cortisol levels had a decrease in cortisol after their mom sang to them for 10 minutes – that is, they went from a more aroused state to a more chilled out state after their mom sang to them.

This was interesting to read, because I’ve definitely found that my singing to T can have totally different effects on him, even singing the same song! Sometimes, he’ll get really excited and wound up and ready to play, and other times, he’ll totally relax and often, will get kind of drowsy.


[1] Trainor, L.J., “Infant Preferences for Infant-Directed Versus Noninfant-Directed Playsongs and Lullabies.” Infant Behavior and Development. (19) 83-92 (1996). (full text here)

[2] Trehub, S.E. et al. “Mothers’ and Fathers’ Singing to Infants.” Developmental Psychology. Vol. 33, No. 3. 500-507 (1997). (full text here)

[3] Shenfield, T. et al. “Maternal Singing Modulates Infant Arousal.” Psychology of Music. Vol. 31, No. 4. 365-375. (2003). (full text here)

Article Review – “Musician Advantage for Speech-on-Speech Perception”

Today, I want to talk about a recently published article (full text here) that isn’t directly related to babies or hearing loss, but that I found really interesting and wanted to share! The article is “Musician Advantage for Speech-on-Speech Perception.” (Baskent, D. and Gaudrain, E. “Musician Advantage for Speech-on-Speech Perception.” J. Acoust. Soc. Am. 139, EL51. 2016).

Also, this paper got some great publicity in Scientific American!


Anyone who’s tried to have a conversation in a crowded bar or restaurant knows that understanding what one person is saying when there’s background noise of other people talking is one of the hardest listening tasks (and one that people with hearing loss struggle the most with!). One of the challenges of understanding speech in the presence of other, competing speech is segregating the different people talking to be able to focus on the one person you want to hear (I talked a bit about differences between babies and adults in this type of task here).  This problem is often called the “cocktail party problem” – that is, if you’re in a noisy, crowded environment with other people talking, being able to understand  what one person you’re having a conversation with is saying.

The authors of this study hypothesized that musicians would be better able to understand speech in the presence of other, competing speech better than non-musicians. If musicians ARE better at understanding speech-on-speech, this might be for a few different reasons. First, musicians are better at identifying subtle changes in pitch (something they do all the time to know if they are playing something correctly and in tune!), and this might be really helpful for separating multiple speech streams. For example, they might be able to use pitch differences to group words that they hear as belonging to different voices. Secondly, over decades of practice, musicians hone their “listening skills” – so it might be that they are just better at shifting their auditory focus to what they want to hear than non-musicians.

So, the researchers first wanted to see if the musicians had an advantage at all. They also wanted to know, if the musicians did have an advantage, if the advantage seemed to be related to their better ability at detecting pitch changes, or if it seemed to be more generally related to an increased ability to shift focus to different speech streams.

The Study

The researchers tested 18 musicians and 20 non-musicians on their ability to understand a sentence (the target) in the presence of one competing talker (the masker) – so the subjects had to understand one person talking who was competing with a second person talking. In order to qualify as a musician for this study, participants had to have had 10+ years of training, began musical training before they were 7 years old, and had to have received musical training within the past 3 years.

To probe whether musicians were more able to take advantage of subtle pitch changes than non-musicians, the researchers manipulated how different the target sentence was from the masker sentence in 2 ways:

  1. The fundamental frequency (F0) – the fundamental frequency (F0) indicates the voice pitch of a person’s speech. So, men generally have lower F0s than women, children have lower F0s than adults, etc.
  2. An estimated Vocal Tract Length (VTL) – The vocal tract is a cavity that filters sounds that you produce – in a very simplified view, it’s kind of like a tube that goes from the vibrating vocal folds at one end to your mouth at the other end, and it helps shapes different sounds that you produce to make them sound like different vowels or consonants. The length of the vocal tract varies across people – children have shorter vocal tracts than adults, and men generally have longer vocal tracts than women. VTL doesn’t directly affect voice pitch (like F0), but it changes other frequencies in speech sounds (the formants – definitely getting a bit technical, but really interesting!). If you have two recordings of people talking and they have the same F0 but different VTLs, the pitch (how high or low their voice is) will be the same, but the quality and characteristics of their voice will sound different – that’s the VTL at work!

The researchers used some fancy software to manipulate the F0 and VTL of the target sentences and the masker sentences so that, in each trial the subjects listened to, the target and masker sentences were more alike or less alike. They measured how well musicians and non-musicians were able to understand the target sentences based on how similar the target sentence was to the masker sentence in terms of these two parameters.

And here are the results!

FIG. 1A (reproduced below) shows the average percent of the sentence the subjects correctly repeated back with various differences in VTL and F0 between the target and masker sentence. The leftmost panel shows the smallest difference in VTL between the target and masker sentences (in the leftmost panel, there was no difference in VTL), and the rightmost panel shows the largest difference in VTL between the target and masker. Within a panel, going left to right increases the F0 difference between the target and masker sentences (so, within a panel, the leftmost points are where the target and masker sentences had the same average voice pitch as each other).

The data from the musicians is shown in purple and the data from the non-musicians is shown in green.


FIG. 1A from Baskent and Gaudrain


As you can see, both musicians and non-musicians were better able to understand the target sentence when the target sentence was “more different” than the masker sentence – if you look at the leftmost points in the leftmost panel (the hardest condition where there was no difference in F0 or VTL between the target and masker sentences), musicians had about 70% intelligibility and non-musicians had about 55% intelligibility. However, looking at the rightmost points in the rightmost panel (the easiest condition where there was the largest difference in both F0 and VTL between the target and masker sentences), both musicians and non-musicians did really well – better than 90% intelligibility. This makes a lot of sense – it’s easier to understand what a (high-pitched) child is saying when their speech is competing with a deep-voiced man compared to trying to understand what one child is saying when their speech is competing with another child.

And, regardless of how different the target and masker sentences were, musicians performed better than non-musicians – and a fairly substantial difference – you can see that the purple points are generally ~15-20 points higher than the green points.

Recall that the researchers wanted to know if a musician advantage was due to the musicians’ ability to detect very subtle pitch differences. Based on this data, it seems like the musician advantage might not primarily be due to musicians’ better pitch perception – in FIG. 1A above, the purple (musician) and green (non-musician) lines are parallel to each other, indicating that both groups were deriving equal benefit from larger pitch differences (larger differences in F0). So, it might be that the musicians are better than the non-musicians at focusing their auditory attention – after all, musicians do this all the time when they practice; for example, a musician in an orchestra has to both listen to what their section is playing as well as what the other sections are playing.

My Reflections

I couldn’t help relating the results of this study to my personal experiences! I started playing the violin and the piano when I was little (~6 years old), and played through college, although I haven’t played regularly since I finished college (many years ago).

I’ve long suspected that I’m much better at understanding speech in noise compared to my husband, G. (This is just a gut feeling, we haven’t thoroughly confirmed this). For example, when G and I go out to eat, I’m usually much better at simultaneously listening to him while eavesdropping on conversations next to us. If G wants to eavesdrop, he’ll have to stop talking to me and stop eating to focus his attention on what the people next to us are saying (while trying hard to look like he’s NOT paying attention to what they’re saying!). So, maybe it’s my childhood musical training that’s given me an edge here!








Article Review – “Infants’ listening in multitalker environments: Effect of the number of background talkers”

This week, I’m going to talk about this study (full text available!) looking at infants’ ability to listen in noise.  (Newman, R.S. “Infants’ listening in multitalker environments: Effect of the number of background talkers.” Attention, Perception, & Psychophysics. 71(4), 822-836, 2009).


As anyone who has tried to have a conversation in a noisy bar or restaurant can tell you, understanding speech in noisy environments, particularly when the noise is other people talking, is REALLY difficult.

Adults tend to do better at listening to a target talker when the competing noise is just one other talker compared to when the competing noise is several talkers all at once (like the din of a crowded restaurant). This difference could be for a couple of reasons. First, when the competing noise is just a single talker, adults may be able to recognize words or a topic of the competing talker, and use that context to selectively switch their attention away from the competing talker and toward the target talker. Secondly, speech naturally has pauses (like between syllables, phrases, or sentences), and adults may use pauses in a competing talker’s stream of speech to hone in on what a target talker is saying – with multiple talkers, the pauses tend to all average out so that there aren’t really any pauses (just a steady “roar”), which might make listening in the presence of multiple talkers more challenging for adults.

In this study, the researchers wanted to see if this is true for infants, as well. Note that the infants in this study were normally-hearing, and I’m not sure how the results would translate to infants with hearing loss.

The Study

The researchers had infants (an average age of about 5 months old) listen to a target stream of speech in the presence of competing speech. The target stream of speech consisted of a person saying a name, which could either be the infant’s name, a name other than the infant’s name that was similar (a “stress-matched foil”), or a name other than the infant’s name that wasn’t particularly similar (a “non-stress-matched foil”). The competing speech was either a single voice, or a composite of 9 voices all talking at the same time.

The researchers measured how long the infants listened to each name in the presence of the competing speech, the idea being that infants would listen for a longer duration of time to someone saying their own name if they recognized it. So, the researchers wanted to see if the infants listened longer during trials in which their name was said in the single-voice noise condition compared to the multi-voice noise condition to see whether infants were better able to recognize their own name in one condition versus the other.

And now, on to the results! FIG. 1 shows how long infants listened to their name compared to the other names in both a multi-voice competing speech condition (left-most panel) and a single-voice competing speech condition (middle panel).


Interestingly, the infants listened significantly longer to their own name compared to other names in the nine-voice noise condition but there was no difference  in the single-voice noise condition. This suggests that infants had more trouble understanding speech (in this case, recognizing their name) in the single-voice noise condition, which is the opposite of adults!

The researchers hypothesized that the infants might have had more trouble in the single-voice noise condition because they might have recognized the single-voice as speech and found it interesting, or possibly, because they recognized some of the words in the single-voice competing speech and therefore, focused on it. This is different than what an adult might do in the same situation – if an adult is trying to focus on one talker, but there is a single competing talker nearby, they might recognize words from each conversation and realize that the topics of each conversation are different. For example, the first talker might be saying words like “breakfast,” “pancakes,” and “eggs,” and the second talker might be saying words like “rain,” “umbrella,” and “soaked” – an adult listener might be able to use these words to identify topics of each conversation and they could then target their attention on the conversation they’re interested in (this all happens subconsciously, of course!). On the other hand, a baby might recognize a few words in each conversation, but might not have the vocabulary to group recognized words into topics, making the two conversations harder to disentangle. In the case of a multi-talker competing background noise, neither the adult nor the baby would recognize individual words in the background noise – this might be detrimental to the adult (who can’t segregate the noise from the target speech based on conversation topic or gaps in the noise), but might be helpful to the baby (who isn’t distracted by a competing talker that seems like they might be saying something interesting).

To try and address the issue of why the single-talker competing speech condition was so difficult for the infants, the researchers repeated this task, but using single-talker speech played BACKWARDS! In this case, the competing speech would have some acoustic properties similar to single-talker speech played forwards (e.g., gaps in the speech, changes in loudness, changes in pitch, etc.), but would be different in that the infants wouldn’t be able to recognize any words.

The results of this experiment are shown in FIG. 1 (above) in the right-most panel – as you can see, there was no difference in how long the infants listened to their own names versus other names in the single-talker speech played backwards condition. This indicates that the infants had a hard time recognizing speech in the presence of the single-talker backwards noise. This in turn suggests that the infants’ difficulty with understanding speech in the presence of a single competing talker is not due to recognizing some words in the competing speech and finding that distracting, but rather due to other characteristics of competing single-talker speech.

My Reflections

I thought it was so interesting that adults find a multi-talker background noise (like a restaurant) to be more difficult than a single competing talker but that infants are the opposite. I often extrapolate my experiences to T – if we are in a crowded restaurant, I assume he must have a harder time understanding what we’re saying than if there’s just one or two people talking nearby, because *I* find the crowded restaurant more difficult to listen in. It never occurred to me that it might be exactly the opposite for T!

This article also highlighted to me how much cognitive development is required for babies to mature to the point where they can listen to speech in noisy environments the way adults do. For example, they need to learn enough vocabulary to be able to group words in a conversation into topics, learn how to listen in the gaps of competing speech (like between sentences or phrases) to focus in on the target speech, and all sorts of other things – and all of this takes time and experience! I think this is especially important to remember because infants often spend a lot of their waking hours in environments that are very noisy – like daycare!

Additionally, this is yet another study that made me think about the importance of hearing aids for children with hearing loss – this study was done with normally-hearing infants, and they had a hard time understanding speech in noise – this difficulty must be so much worse for infants with hearing loss!


Article Review – “Vocalizations of Infants with Hearing Loss Compared with Infants with Normal Hearing: Part II – Transition to Words”

Last week, I talked about Part 1 of this study, which compared the initial, babbling stage of infant language development for infants with hearing loss and normally-hearing infants. This week, I want to talk about Part 2 of the study, which looked at how babies, as they got older, transitioned from babbling to producing words. Here’s a link to a full PDF of the study.


Part 1 of this study found that infants with hearing loss (HL) generally are delayed relative to normally hearing (NH) infants in the babbling stage of language development. HL infants took longer to begin babbling, and, once they began babbling, were slower to acquire particular types of consonants, such as fricatives (“sss,” “shhh,” “f,” etc.). The researchers wanted to then look at older babies to see whether HL infants were also delayed in transitioning from babbling to producing words relative to NH infants.

The Study

The infants included in this study were the same as those in Part 1 – to recap, there were 21 NH infants and 12 HL infants. The HL infants varied a lot in degree of hearing loss, and three received cochlear implants (CIs) during the course of the study. For all infants, language productions were monitored during play sessions with a caregiver (typically the infant’s mother), and these sessions were generally conducted every 6 weeks. In Part 2 of the study, data from sessions when the infants were between 10 and 36 months old were used.

Let’s get to the results!

The researchers analyzed the infants’ language productions during the sessions in 2 broad categories: the proportion of different utterance types at different ages and the structural characteristics of words produced at 24 months.

To look at the proportion of different utterance types at different ages, the researchers coded each utterance produced by an infant during a session as belonging to one of 3 utterance types:

  1. Non-communicative – these were speechlike sounds but were more vocal play than attempts to communicate. Examples include babbling that wasn’t directed to an adult.
  2. Unintelligible communicative attempts – these were vocalizations that were a) directed to an adult and b) served a communicative purpose, such as getting the adult to do something, seeking attention, etc. Some of these might have been attempts by the infant to say a particular word, but weren’t recognized by the caregiver or the researchers as a word.
  3. Words – the researchers pointed out that it’s tricky to decide what constitutes a word. For this study, utterances were classified as words if: 1) at least one vowel and consonant in the word attempted by the infant matched the “real” word (e.g., “baba” for “bottle”), 2) the utterance was a communicative attempt (see #2 above), and 3) it was clear that the child was attempting to say a word, for example, that the infant was imitating the parent or that the parent recognized the word and repeated it.

FIG. 1 of Moeller, et al. (reproduced below) shows the results of the analysis of utterance type for NH and HL infants at 16 months old and 24 months old.



FIG. 1 of Moeller, et al. – Proportions of different utterance types of NH and HL infants at 16 months and 24 months.

As you can see in FIG. 1, at a given age, the pattern of the proportion of different response types was different for NH infants compared to HL infants. For example, at 16 months, the NH infants were producing more unintelligible communicative attempts as well as more words compared to the HL infants. As another example, at 24 months, a greater fraction of utterances for the NH infants were words compared to the HL infants. Additionally, while both the NH infants and the HL infants produced more words at 24 months compared to 16 months, the researchers found that the magnitude of the increase was larger for NH infants. Interestingly, the researchers found that the pattern of utterance types for the HL infants at 24 months was similar to that of the NH infants at 16 months (I highlighted these in the red boxes in FIG. 1 above), indicating that the HL infants might have a similar pattern of improvement over time, but delayed.

To look at the structure of word attempts by the infants at 24 months, the researchers randomly selected 25 words from each child’s transcripts during the experimental session and compared the word attempt with the actual, target word to assess both the complexity of the word attempt and how accurate the attempt was. They computed 7 different metrics:

  1. Mean syllable structure level (MSSL) – this metric was used in Part 1, as well, and I described this in more detail here.  As a quick recap, words with only vowels were scored with 1 point, words with a single consonant type were scored with 2 points (e.g., “ba” or “baba”) and words with two or more consonant types were scored with 3 points (e.g., “bada” or “dago”).
  2. Percentage of vowels correct – this indicates the percentage of the time that the infant’s vowel productions in their word productions matched the “correct” vowels in the corresponding word. For example, if the target word was “mama,” the child would get 100% for saying “gaga” or “baba” but 0% for saying “momo.”
  3. Percentage of consonants correct (PCC) – this is similar to as above, but with consonants. As an example, if the target word was “shoe,” the child would get 100% for “shoo,” “shee,” “shaw,” etc., but 0% for “too.”
  4. Phonological mean length of utterance (PMLU) – This measure is intended to identify children who attempt longer, more complicated words but produce them less accurately as compared to children who attempt shorter, simpler words but produce them more accurately. To calculate this, the child received 1 point for each vowel and consonant produced, and an additional point for each correctly produced consonant. For example, if the target word was “cat,” at the child produced “cat,” they would receive 3 points for producing “c,” “a,” and “t,” and an additional 2 points for correctly producing the “c” and “t,” for a total of 5 points. However, if the child had instead produced “da” for “cat,” they’d receive only 2 points – one each for the “d” and “a,” but no points for accuracy. In this way, the PMLU reflects both accuracy of the word production as well as the length of the word.
  5. Proportion of whole word proximity (PWWP) – This measure is intended to give an overall reflection of how accurately the child produced a particular word. It is calculated by dividing the PMLU of the word attempt into the PMLU of the target word. As descried above, “cat” produced correctly would have a PMLU of 5, and “cat” produced as “da” would receive a PMLU of 2. Therefore, if a child produced “da” for “cat,” the corresponding PWWP would be 2/5, or 0.4.
  6. Word shape match – This measure indicates how accurate a child’s production of a word was in turns of shape/number of syllables. For example, if the target word was “cookie,” the target shape would be “consonant-vowel-consonant-vowel.” (CVCV). If, instead of producing a word that had a CVCV shape, the child produced one with just a CV shape (e.g., “di,” “koo,” “da,” etc.), this would not be a match.
  7. Words with final consonants – Word productions were given points for this metric if a target word ended with a consonant, and the child’s production of the word also ended with a consonant, even if the consonant wasn’t totally accurate. So, for example, if the target word was “goat,” the child would get points for producing “goat,” “got,” “god,” “goad,” etc.

The results of the structural analysis of the children’s word productions are shown in Table 1 of Moeller, et al. (reproduced below).


Table 1 of Moeller, et al. – comparison of word structure for NH and HL children at 24 months old.

This table shows that, for every measure of word structure (the rows in the table), the NH hearing children performed better than the HL children. The difference between the NH and HL children was statistically significant (this is indicated by the crosses in next to the score for the NH children in each row).

One of the things that I really like about this table is that they indicate the Effect Size for each metric of word structure (this is indicated in the right-most column of the table). The effect size tells you the strength of of the finding. For the measure of effect size used in this paper (called “Cohen’s d”), an effect size of around 0.2-0.3 is considered a small effect, an effect size of around 0.5 is considered a medium effect, and an effect size of more than 0.8 is considered a large effect. So, as you can see from this table, for every metric of word structure, the researchers found that not only was there a statistically significant difference in performance between the NH children and the HL children, but that the size of this difference was large.

So, overall, the data in Table 1 indicates that compared to age-matched NH children, HL children were producing words that were less complex (contained fewer different types of consonants, were less likely to end in a consonant, and were shorter) and that tended to be less accurate representations of the target word (an incorrect number of syllables or producing an incorrect vowel or consonant).

The researchers also looked at the number of words each child could produce as a function of age. FIG. 4 of Moeller, et al. (reproduced below) shows this data (the top two panels of FIG. 4 show data from this study; the bottom two panels show data from two other studies for comparison). The number of words was determined by asking the child’s caregiver to fill out an evaluation at home at each time point.


FIG. 4 of Moeller, et al. – The number of words produced by HL children (top left panel) and NH children (top right panel) as a function of age.

In FIG. 4, you can see that the curves for the NH children (top right panel) are both steeper and shifted to the left compared to the curves for the HL children (top left panel). This indicates that the NH children began producing words at a younger age relative to the HL children, and that, once they began producing words, their vocabularies expanded at a faster rate. The researchers noted that there was considerable variability in the data (for example, you can see that some of the NH children had much shallower curves than others, indicating that they were acquiring words more slowly than their peers), but that the individual data collected in this study “suggest a much slower rate of early vocabulary development compared with NH children.”  (Moeller, et al. p. 636).

One cool thing – in the panel for the HL children (upper left), the curves with unfilled symbols indicate children with CIs – one of the best performing children in this group had a CI! I thought this was pretty remarkable!

Since there was so much variability within the group of HL children regarding degree of hearing loss, the researchers weren’t really able to say much about how degree of hearing loss affected language production in this study.

My Reflections

T has been babbling up a storm for a few months now, but this paper made me think about the different contexts of his babbling (e.g., non-communicative, unintelligible communicative, and words/word attempts). Of course, at this age, T’s babbling is essentially entirely non-communicative or unintelligible communicative (and no words/word attempts). Reflecting on these distinctions, I think that T tends to babble in the non-communicative category primarily when he’s relaxing – like riding in the stroller or in his crib at night (or the wee hours of the morning) – at these times, he’ll go on a long, uninterrupted soliloquy, complete with big variations in vocal inflection. T’s babbles that fall in the unintelligible communicative category seem to happen when we’re playing interactively with him (to tell us to do something again), when he wants something (usually food), or when he’s excited about something (he’ll often shout “DAY-DA!” while looking at us when he’s excited – usually when we open the refrigerator door). I think the distinction in types of communication based on activity/mood makes sense – if non-communicative babbling is a form of vocal play, (that is, allowing T to play with making different sounds), it makes sense that this would come most naturally to him when he’s just chilling.

At 10 months, T is on the young age compared to the ages of the children studied here. However, I think he’s allllllmost on the cusp of his first word. At least a couple times, it seems like he was fairly consistently saying “a-ga” for “again” (to ask us to do something again) and saying “bah-bol” for “bubble” (to ask us to blow more bubbles). I’m not sure these are consistent enough to count as his first word (for example, he’ll say “a-ga” at other times too), but it seems like he might be close. We try to really reinforce when we think he’s saying something that might have meaning – for example, if he says “dada” and it seems plausible that he’s saying something to or about his dad, we’ll make a big production of saying the word “dad.” We do the same thing for “again” and “bubble,” and I think this repetition is helping him connect the sound of the word to the concept/object.

One thing this study made me excited about – I didn’t realize how rapidly vocabulary grows once children start talking! I get the feeling that T is thinking some pretty fun thoughts (like when he starts grinning when he sees the trash can and races over to look inside), and I can’t wait to hear what he’s thinking once he starts talking.


Article Review – “Vocalizations of Infants With Hearing Loss Compared with Infants with Normal Hearing – Part 1: Phonetic Development”

As T’s (9 months) babbling has taken off, I’ve started to become interested in the order in which infants tend to acquire different speech sounds as well as how this might differ for infants with hearing loss vs. normally-hearing infants. I started doing a little Googling, and found this study (link to Abstract only) that compares vocalizations of infants with hearing loss to infants with normal hearing. (Moeller, M.P., Hoover, B., Putman, C., Arbataitis, K., Bohnenkam, G., Peterson, B., Wood, S., Lewis, D., Pittman, A., and Stelmachowicz, P. “Vocalizations of Infants with Hearing Loss Compared with Infants with Normal Hearing: Part 1- Phonetic Development.” Ear & Hearing, Vol. 28 No. 5, 605-627. 2007).

This study actually has two parts, the first looking at babbling with younger infants (up to 2 years old), and the second looking at older children and how they transition from babbling to acquiring words. This week, I’ll talk about part 1, and will hopefully write about part 2 next week.


It’s well-known that infants with hearing loss develop spoken vocabulary later than normally-hearing children. However, there’s a lot of language development happens before children start speaking words. For example, infants typically start off making vowel sounds, and then progress to babbling (like “bababa,” “dadada,” etc.). Less is known about how hearing loss affects this earlier stage of language development.

The Study

The researchers enrolled a group of normally-hearing (NH) infants and a group of infants identified as having hearing loss (HL). This was a longitudinal study, so each infant was followed over time – their spoken language was measured in an experimental session conducted every 1.5-2 months, from when the study began (generally when infants were 4 months old) until they were 36 months old. There were 21 NH infants, and 12 HL infants. All of the infants with HL had assistive technology, typically hearing aids, although 3 received cochlear implants (CIs) during the course of the study. The degree of hearing loss varied a lot for the HL group; on average, across the group of HL infants, they had a 67 dB HL Better Ear Pure Tone Average (BEPTA – meaning that the average audiogram for the infant’s better ear measured at 500, 1000, and 2000 Hz was 67 dB HL). All of the HL infants were involved in some form of early intervention.

To collect the data, at each session, each infant played with a caregiver while their interaction was taped and then transcribed. The infants would play with a parent or guardian, and the researchers transcribed each vocalization by the infant – for example, identifying a particular vowel or consonant, whether a sound was a grunt, cry, or squeal, etc.

There were 3 main categories of metrics the researchers looked, which were:

  1. Volubility – this indicates how much the infants vocalized over a session – were they pretty chatty during the session, or fairly quiet?
  2. Age at which the infant began babbling
  3. Speech complexity – here, the researchers looked at what types of consonants the infants were producing at a particular age, as well as whether they were able to string different types of sounds together to make more complex sounds.

Let’s get to the results!


To measure volubility, for each experimental session, the researchers calculated the infant’s vocalizations per minute. Vocalizations could be any sounds other than stuff like grunts, screams, cries, etc. So, an infant with a higher volubility score would have vocalized more during the session compared to an infant with a lower volubility score.

FIG. 1 of the article (shown below) shows the volubility results for both NH infants (left) and HL infants (right). In the figure, volubility scores are shown for infants at 3 different ages – 8.5 months, 10 months, and 12 months. As you can see in FIG. 1, the volubility scores for HL infants was really similar to that of NH infants, and the researchers found no significant difference between the two groups. I thought it was pretty interesting that, at each age, the HL infants seemed to be vocalizing as much as the NH infants!


FIG. 1 of Moeller, et al. – Volubility of NH and HL infants as a function of age

Age of Babbling Onset

The researchers then quantified the age at which the infants began babbling. Although we (or at least, I!) tend to think of babbling as any infant pre-word “talking,” babbling technically requires a consonant-vowel (CV) pairing – examples include “ba,” “da,” “ga,” etc. CV pairs can also be chained together, either the same consonant and vowel (“baba”) or different consonants and/or vowels (“babo,” “bada,” etc.)

In order to set a criteria to define the age of babbling onset, the researchers identified the age at which the proportion of babbles out of the total vocal utterances exceeded 0.2 – so this was the age at which, during an experimental session, more than 20% of the infant’s vocalizations were consonant-vowel pairings.

FIG. 2 of the article (shown below) shows, at each age, the proportion of infants in the NH group (black bars) and HL group (white bars) who had started babbling (defined as having more than 20% of their vocalizations during the session include a CV-pairing). As you can see, NH infants tended to begin babbling much earlier than HL infants – it took roughly 6ish additional months for the HL group to reach the milestone of having 50% of the infants in the group babbling compared to the NH group. The researchers also stated that, for the HL group, there was a correlation between the age at which the infants first received hearing aids and the age at which they began babbling, although this wasn’t statistically significant (possibly because there were only 12 infants in the group, and they varied a lot in degree of hearing loss).


FIG. 2 of Moeller et al. – Proportion of infants who had began babbling by age

Babble Complexity

The researchers quantified the complexity of the sounds the infants were producing by scoring each utterance as follows:

  1. 1 point for utterances that were vowels or primarily vowels – (like “ahhh,” “eeee,” “waaa,” etc.) – this was labeled SSSL1
  2. 2 points for utterances that had 1 type of consonant – (like “ba,” “da,” “baba,” etc.) – this was labeled SSSL2
  3. 3 points for utterances that had 2 or more types of consonants – (like “bada,” “gaba,” gabo,” etc.) – this was labeled SSSL3
  4. 4 points for utterances with consonant blends, like “spun.” – this was labeled SSSl4

FIG. 4 of Moeller et al. shows the proportion of utterances that belonged to each point category for both NH infants (top) and HL infants (bottom).


Adapted from FIG. 4 of Moeller et al. – proportion of utterances in each babble complexity category as a function of age

As you can see, for both NH infants and HL infants, vocalizations by the youngest babies (10-12 months) were dominated by the simplest type of vocalization – primarily vowels. Both groups tended to increase the proportion of more complex vocalizations – those containing consonants and multiple types of consonants – with age. One really interesting thing you can see in the above figure is that HL infants at 18-20 months had a babble complexity pattern that was similar to the NH infants at 10-12 months (I highlighted these in the red boxes above) – this is a pretty substantial delay. However, by the time the HL infants were 22-24 months old, the pattern resembles that of the NH infants at 18-20 months (highlighted in the green boxes above), indicating that the HL infants were closing the gap! This could be the result of amplification for the HL infants, early intervention services, as well as the fact that three of the HL infants received cochlear implants during this time period.

Phonetic Inventory

The researchers then looked at whether NH infants and HL infants differed in the rates at which they started saying vowels and different types of consonants. FIG. 5 of Moeller et al. (reproduced below) shows the infants’ progression in acquiring both vowels and consonants broken into different classes based on place of articulation. A consonant’s place of articulation indicates what part of the mouth is involved in obstructing the vocal tract – I wrote more about it here. Here’s a quick overview of the different classes of consonants shown in FIG. 5 below:

  1. bilabials – these are consonants produced with the lips pressed together (e.g., p, b, m, and w).
  2. labiodentals & interdetals – labiodentals are produced with the lower lip between the teeth (e.g., f and v). interdentals are produced with the tongue between the teeth (e.g., th).
  3. alveolars – these are produced with the tip of the tongue behind the top teeth (e.g., d and t).
  4. palatals – these are produced with the body of the tongue raised against the hard palate (e.g., j).
  5. velars – these are produced with the back part of the tongue against the soft palate (e.g., k and g).

Each panel in FIG. 5 shows the percent of sounds within a given category that the infants produced at a particular age. So, for example, there are 4 bilabial consonants (p, b, m, and w), and infants who could produce 2 out of the 4 at a particular age would receive a score of 50% for that age.


Adapted from FIG. 5 of Moeller, et al. – % of sounds produced in different phonetic categories as a function of age.

One thing that was interesting to me is that bilabial consonants seemed to be one of the “easier” sounds to produce in general (look at the top row, middle panel) – for both NH and HL infants, scores were fairly high at every age range, and the gap between NH and HL infants was fairly small as well. The researchers said that this might be because bilabial consonants tend to be very visually salient compared to other places of articulation – it’s pretty easy to see lips pressed together compared to where your tongue is inside your mouth! This might make it easier for infants to acquire bilabial consonants, since they can more easily see how they are formed.

Another interesting thing about Fig. 5 – the researchers found that acquisition of these different classes of sounds generally fell into 3 different categories, which I’ve highlighted by color in the above figure. For vowels and alveolar consonants, the HL infants were generally delayed relative to the NH infants, but their rate of acquisition was parallel (this is highlighted in blue above). For bilabial consonants and velar consonants, the HL infants seemed to be closing an initial gap relative to the NH infants – that is, their acquisition of these classes of consonants was converging with the NH infants (this is highlighted in green above). Conversely, for palatal consonants and labiodentals/interdetals, the HL infants seemed to be acquiring consonants in these classes at a slower rate than the NH infants – that is, over time, the gap between the HL infants and the NH infants widened. One thing to note is that, for both NH and HL infants, palatal and labiodental/interdental consonants (highlighted in red above) occurred less often in general compared to other consonants – regardless of hearing, children tend to take longer to produce these types of sounds, perhaps because they tend to be less common in English.

The researchers then broke the consonants up in a different way – into fricatives and non-fricatives. Fricatives are consonants that are produced by forming a small opening with the mouth and forcing air through – they include sounds like “ssss,” “shhhh,” “f,” and “zzz” – fricatives are the ones that sound kind of “hissy”! This hissyness also makes fricatives generally hard for people with hearing loss to hear – fricatives tend to have a lot of high frequency components and are often low in intensity. FIG. 6 of Moeller, et al. (reproduced below) shows the rate of acquisition of non-fricatives (left) and fricatives (right) for both NH and HL infants.


FIG. 6 of Moeller, et al. – Acquisition of non-fricative and fricative consonants.

As you can see, acquisition of the non-fricative consonants was parallel for both the HL and NH infants – both groups had a steady increase in production of non-fricative sounds. However, for fricatives, while the NH infants steadily increased their production of these sounds, the HL infants didn’t – they seemed sort of stuck from 10 months to 24 months and, in general, didn’t really add many consonants from this group into their repertoire. As I mentioned above, this might be because fricatives tend to be really hard to hear for people with hearing loss, so the HL infants might have not had enough exposure to these types of sounds to begin producing them.

My Reflections

I was particularly interested to read this study since T’s consonant inventory seems to have grown a lot just in the past 2 weeks. Although he’s been saying “da” for awhile (EVERYTHING is “dada”!), he’s started more consistently saying “ba” and “ma” (both are bilabial) and, just in the past few days, has started saying “la” (I think this is alveolar). From the data presented in this study, it seems like bilabials tend to be one of the “easiest” categories of consonants – babies tend to produce the highest proportion of consonants in this class at earlier ages relative to other categories, and this might be because of how easy it is to see the lips pressed together when producing bilabial consonants. Although T’s preferred consonants (the ones we hear more often) are “da” (alveolar) and “ga” (velar), I think we’ve heard him produce most of the bilabial consonants at least a few times now. And, lately, if we really emphasize the position of our lips while saying “pa,” “ba,” or “mmm,” he’ll try to imitate us!

One of the things I think I gained from reading this study was an appreciation for the activities we do at speech therapy and a deeper understanding of how those activities will help T acquire different speech sounds. One thing we really focus on is drawing T’s attention to different sounds by pairing the sound with something interesting and visually salient – this gets him to really listen to the sound rather than just have it be background noise that he might not pay attention to. We’ll do this in different ways, for example, pointing at our mouths, waving toys or ribbons around as we make the sound, etc. I think that, especially for children with hearing loss, merely passively hearing different sounds isn’t quite enough, and having their attention drawn to the sound and the way your mouth looks when you make the sound can help tie everything together.

Once again, this study highlighted the importance of T wearing his hearing aids! I think it’s really important for him to get as much good, high-quality exposure to all these different speech sounds so that he can start to produce them, and this is especially important for fricatives (like, “sss,” “shhh,” “f,” etc.). The “s” sound in particular is really important for English grammar – it’s what turns a singular noun into a plural – and the study that I wrote about here found that children with hearing loss tend to have more trouble with this grammar rule than normally-hearing children.

Finally, on a happy (for me) note – there are a few bad words I’ve been known to accidentally say in front of T that start with fricatives (I’ll let you figure out what they are) – I’ve been thinking I need to clean up my language, since I’m been worried that once T really starts talking, he’ll out me by repeating something he’s heard me say totally out of the blue. But, from the results of this study, it looks like children, whether normally-hearing or with hearing loss, don’t tend to really start producing fricatives until they are quite a bit older than T is now – so it looks like I have a little while before I have to be worried about T surprising me by dropping a fricative-bomb!

Article Review – “Voice Emotion Recognition by Cochlear-Implanted Children and Their Normally-Hearing Peers”

This week, I’m going to talk about a new study (PDF available for free through the link) by Chatterjee et al. (2015) that looked at how well adults and children can identify vocal emotion and how each group compares to their peers. (Chatterjee, M. Zion, D.J., Deroche, M.L., Burianek, B.A., Limb, C.J., Goren, A.P., Kulkarni, A.M., and Christensen, J.A. “Voice Emotion Recognition by Cochlear-Implanted Children and Their Normally-Hearing Peers.” Hearing Research (322), 2015, 151-162).


Detecting and identifying emotions in speech is really important for communication and social interaction. For example, if you’re talking with someone, and they mention that they just bought new pants, it’s important to be able to identify any subtext underlying their statement. Are they excited that they finally had time to go shopping? Are they angry that they spilled coffee all over their old pants? Are they sad to admit a favorite pair will no longer button? Identifying the emotion behind the statement is crucial to knowing how to respond appropriately! And, identifying the emotion isn’t just important for following-up; one study has even found that the ability of children to identify vocal emotion is correlated with their assessment of quality of life [1].

In a face-to-face conversation, facial expressions can aid in identifying vocal emotions. However, it’s harder in non-face-to-face conversation, such as on the phone. In those situations, we rely entirely on acoustic cues to distinguish different emotions from each other. These acoustic cues can include stuff like how fast we talk, pitch, how our pitch changes over the course of a sentence, and loudness.

Cochlear Implants (CIs) convey some of these cues better than other cues. For example, CIs tend to convey speaking rate very well but they are pretty bad at conveying pitch and changes in pitch accurately. (This is a fairly complex topic, and I don’t want to get too into the weeds here, so for now I’ll leave it at that).

Since identifying vocal emotion could potentially rely on many different acoustic cues, some of which are not accurately conveyed by CIs, Chatterjee et al. wanted to measure how well CI users could identify vocal emotion in speech. They looked at both children (who were pre-lingually deafened), and adults (who were, for the most part, post-lingually deafened, and therefore acquired language as children prior to receiving a CI).

The Study

The researchers studied 4 groups of people: normally-hearing children, children with CIs, normally-hearing adults, and adults with CIs. All of the participants were asked to listen to several sentences, and, for each sentence, identify whether the emotion underlying the sentence was happy, sad, scared, angry, or neutral. Although the sentences were neutral in content (an example is “her coat is on the chair”), the sentences were spoken by one of two talkers who were instructed to speak the sentence using one of the five emotions, and to really exaggerate the emotion. Sentences were recorded by one male talker and one female talker.

This article has a mountain of interesting results, but I’m going to focus on a few results that I found particularly interesting – I definitely encourage you to check out the article and look at the rest of the results yourself!

CI users (children and adults) had more trouble identifying vocal emotions than their normally-hearing peers


FIG. 5 of Chatterjee, et al – vocal emotion recognition scores for all test subject groups

The above figure (FIG. 5 from the article) shows the performance of each group (adults with normal hearing [aNH]; adults with cochlear implants [aCI]; children with normal hearing [cNH]; and children with cochlear implants [cCI]). Since there were 5 choices of emotion for each sentence, if a participant had guessed randomly, they would have scored 20% correct (this is marked in the figure by the black horizontal line). As you can see, on average, all of the groups did well above chance. However, while the normally-hearing participants, both adults and children, got almost 100% correct, the CI users had more trouble. The researchers found that the children with cochlear implants performed worse than both adults and children with normal hearing and, in general, similarly to adults with cochlear implants.

Another interesting thing you can see in the figure is the effect of the gender of the talker – in particular, CI users did worse identifying emotion for the male talker compared to the female talker. This is especially true for the adult CI users. One note of caution on this result though – the study only used sentences spoken by 1 male and 1 female, so this data isn’t enough to extrapolate CI users ability to recognize emotion for male talkers vs. female talkers in general.

Emotions that were easily confused & corresponding acoustic cues

The graph above (FIG. 5 from the article) shows that CI users did worse at identifying emotions than the normal hearing participants, but that’s for all emotions lumped together. The researchers also looked at what emotions the participants were likely to confuse for each other – for example, is happy often mistaken for scared?

One way to look at which emotions are confused for each other is by constructing a confusion matrix from the responses. Here’s an example of the confusion matrices for the male talker for adults (top matrix) and children (bottom matrix) with CIs (adapted from FIG. 10 of Chatterjee et al.)


Adapted from FIG. 10 of Chatterjee, et al. – confusion matrices for adult (top) and children (bottom) CI users for the male talker.

Each block in the confusion matrix indicates the number of times the emotion indicated in the column header was identified as the emotion indicated in the row header (averaged over all participants in each group). There were 12 sentences spoken with each emotion, so if a particular group (for example, adults with CIs) were to get a perfect score, the diagonal entries would all say “12.” Instead, in the two confusion matrices shown above, you can see that the diagonal values are higher than the off-diagonal values, but none of the entries are 12, indicating that none of the emotions were correctly identified by CI users 100% of the time.

If we look at off-diagonal entries with relatively high values, we can see which emotions were often confused with one another. I highlighted one example in red – “happy” and “scared.” (“Angry” and “neutral” is another pair that tended to be confused by CI users for the male talker). Note that these are only the responses for the male talker – FIG. 10 in the article shows confusion matrices for both male and female talkers and for both CI users and normally-hearing participants.

After looking at which emotions tended to be confused with each other, I think it’s interesting to see which acoustic cues tend to differentiate the easily confused emotions to see if it makes sense that CI users would confuse them. In this study, the authors looked at how 5 different acoustic cues vary for different emotions. Before I talk about those results, I’ll quickly explain the cues that the study analyzed:

  1. Mean F0 Height – F0 stands for “fundamental frequency.” Mean F0 height basically means the average pitch of the talker’s voice. So, a bass mean F0 height is lower than a soprano’s and male mean F0 height tends to be lower than female mean F0 height.
  2. F0 Range – This indicates how much the pitch of a talker’s voice varies over a sentence. If, over the course of the sentence, the speaker’s voice goes up and down a lot, they’d have a relatively high F0 range. Conversely, if they speak in a monotone, they’d have a lower F0 range.
  3. Duration – This is pretty simple – more quickly spoken sentences will have a shorter duration.
  4. Intensity Range – This indicates how much the speaker’s voice varies in loudness over the sentence
  5. Mean dB SPL – This indicates the average loudness over the course of the sentence

And here are graphs (adapted from FIG. 1 of Chatterjee, et al.) showing how the acoustic cues vary for the different emotions. Although there’s a lot of interesting information in here, I’m just going to focus on the male talker’s duration and F0 range for the “happy” and “scared” sentences, since those two tended to be confused, as discussed above.

acoustic cues.jpg

Acoustic cues for different emotions – adapted from FIG. 1 of Chatterjee, et al.

As you can see from the red boxes in the figure above, the male talker tended to speak “happy” and “scared” sentences with similar durations (look at the red boxes in the panel in the middle row, left column). However, he tended to vary pitch a lot more for “happy” sentences than for “scared” sentences (look at the red boxes in the top right panel labeled “F0 range”). Recall that duration tends to be conveyed well through the CI. However, variations in a speaker’s pitch (how much their voice goes up and down) tend to not be conveyed well through the CI. So, for the male talker, “happy” and “scared” were very similar to each other in a cue that is easy for CI users to use (duration), but they varied a lot in a cue that is hard for CI users to use (F0 range) .

This suggests that CI users tend to confuse emotions that vary primarily in acoustic cues that are not well-conveyed by the CI. (I want to be careful to not overstate this too much: I’m only looking at one pair of emotions that were easily confused for one of the two talkers. Also, the data in the article were produced based on just one male talker and just one female talker, so it’s possible that other talkers vary acoustic cues differently for different emotions – the authors have since collected data from many more talkers, so hopefully we will know more about acoustic cues underlying different emotions soon!)

Comparison of CI users to their peers using a CI-simulator

Chatterjee et al. tested normally-hearing adults and children using a CI simulator to compare the performance in the CI simulation to the actual performance by the CI users. This might sound sort of strange – why simulate the CI users when they collected actual data from the CI users?! One reason is that this particular type of CI simulation, the vocoder, lets us look at a particular type of deficit faced by CI users called spectral resolution. Here’s one way to think about spectral resolution – imagine banging on a piano with a ball – using a smaller ball corresponds to having better spectral resolution (because the smaller ball hits fewer keys), and using a larger ball corresponds to having worse spectral resolution (because the larger ball hits more keys). Using the vocoder, we can see how having better or worse spectral resolution affects performance on a particular task, in this case, identifying vocal emotion. This lets us see whether spectral resolution is important at all for performing the task, as well as how improving spectral resolution might improve performance.

One of the main parameters we can vary in the vocoder is the “number of channels.” Let’s go back to the ball example – 4 channels in the vocoder might correspond to banging on the piano with a basketball (worse spectral resolution), whereas 16 channels might correspond to using a golf ball (better spectral resolution). Although neither ball sounds great, you can imagine that the golf ball is better. This link has examples of what vocoded speech sounds like for different numbers of channels (scroll down to section 2) – if you listen to the sentences there, you’ll notice that it’s pretty easy to understand the sentence with 15 channels, but it’s really hard with 1 or 5 channels.

Ok, so back to the study – Chatterjee et al. tested normally-hearing adults and children using the vocoder with different numbers of channels – adults listened to 4 (worst spectral resolution), 8, and 16 (best spectral resolution) channels, and children only listened to 8 channels. Here’s a figure (adapted from FIG. 6 of Chatterjee, et al.) showing the results:


Performance with a CI simulation – adapted from FIG. 6 of Chatterjee, et al.

If you look at the red and blue boxes in the figure above, you can see that both adults and children with CIs performed similarly to normally-hearing adults listening to a simulator with 8 channels (a medium amount of spectral resolution), and that a simulator with 16 channels (making the spectral resolution better) would have improved performance for at least the female talker.

I think the most interesting thing about this figure is how poorly normally-hearing children listening to the CI-simulator did! Notice that their scores (highlighted by the green box) are much worse than the adults listening to the 8-channel simulator, AND, interestingly, much worse than the children with CIs! This indicates the huge benefit that children with CIs are receiving – they are performing, at least with respect to vocal emotion identification, like adults with CIs, and much better than normally-hearing children listening to a CI-simulator (probably because the children with CIs hear everything in daily life through the CI, whereas it probably takes time for children listening to a simulator to adapt to the sound of the simulations).

My Takeaways

If you’ve read this far – thank you! (Or maybe you’re my husband reading this under duress? Hi, G!)

I think this study has interesting implications for speech therapy for children with CIs – it’s clear from this data that at least some children have trouble identifying different vocal emotions, and focusing on this in some way might go a long way towards overcoming this deficit.

This study only looked at children with CIs, so it’s not clear from this whether children with milder hearing loss who wear hearing aids face the same problems. From interacting with T (9 months, with a mild hearing loss), I think he definitely notices different vocal emotions – for example, he will look up very attentively if I start talking in an angry or frustrated way (umm, not that that happens a lot!), and he’ll stare at me with huge eyes. Also, if my husband and I start talking in an excited way, he’ll sometimes “join in” by smiling and squealing. Although he of course can’t yet label different emotions, I think he’s definitely picking up on some of the acoustic cues underlying them (although, in all of these examples, he’s also certainly picking up on our facial expressions and body language, as well.).


[1] Schorr, JA. Roth, FP. Fox, NA. “Quality of Life for Children with Cochlear Implants: Perceived Benefits and Problems and the Perception of Single Words and Emotional Sounds.” Journal of Speech, Language, and Hearing Research. Vol. 52, 141-152. 2009.