Article Review – “Musician Advantage for Speech-on-Speech Perception”

Today, I want to talk about a recently published article (full text here) that isn’t directly related to babies or hearing loss, but that I found really interesting and wanted to share! The article is “Musician Advantage for Speech-on-Speech Perception.” (Baskent, D. and Gaudrain, E. “Musician Advantage for Speech-on-Speech Perception.” J. Acoust. Soc. Am. 139, EL51. 2016).

Also, this paper got some great publicity in Scientific American!


Anyone who’s tried to have a conversation in a crowded bar or restaurant knows that understanding what one person is saying when there’s background noise of other people talking is one of the hardest listening tasks (and one that people with hearing loss struggle the most with!). One of the challenges of understanding speech in the presence of other, competing speech is segregating the different people talking to be able to focus on the one person you want to hear (I talked a bit about differences between babies and adults in this type of task here).  This problem is often called the “cocktail party problem” – that is, if you’re in a noisy, crowded environment with other people talking, being able to understand  what one person you’re having a conversation with is saying.

The authors of this study hypothesized that musicians would be better able to understand speech in the presence of other, competing speech better than non-musicians. If musicians ARE better at understanding speech-on-speech, this might be for a few different reasons. First, musicians are better at identifying subtle changes in pitch (something they do all the time to know if they are playing something correctly and in tune!), and this might be really helpful for separating multiple speech streams. For example, they might be able to use pitch differences to group words that they hear as belonging to different voices. Secondly, over decades of practice, musicians hone their “listening skills” – so it might be that they are just better at shifting their auditory focus to what they want to hear than non-musicians.

So, the researchers first wanted to see if the musicians had an advantage at all. They also wanted to know, if the musicians did have an advantage, if the advantage seemed to be related to their better ability at detecting pitch changes, or if it seemed to be more generally related to an increased ability to shift focus to different speech streams.

The Study

The researchers tested 18 musicians and 20 non-musicians on their ability to understand a sentence (the target) in the presence of one competing talker (the masker) – so the subjects had to understand one person talking who was competing with a second person talking. In order to qualify as a musician for this study, participants had to have had 10+ years of training, began musical training before they were 7 years old, and had to have received musical training within the past 3 years.

To probe whether musicians were more able to take advantage of subtle pitch changes than non-musicians, the researchers manipulated how different the target sentence was from the masker sentence in 2 ways:

  1. The fundamental frequency (F0) – the fundamental frequency (F0) indicates the voice pitch of a person’s speech. So, men generally have lower F0s than women, children have lower F0s than adults, etc.
  2. An estimated Vocal Tract Length (VTL) – The vocal tract is a cavity that filters sounds that you produce – in a very simplified view, it’s kind of like a tube that goes from the vibrating vocal folds at one end to your mouth at the other end, and it helps shapes different sounds that you produce to make them sound like different vowels or consonants. The length of the vocal tract varies across people – children have shorter vocal tracts than adults, and men generally have longer vocal tracts than women. VTL doesn’t directly affect voice pitch (like F0), but it changes other frequencies in speech sounds (the formants – definitely getting a bit technical, but really interesting!). If you have two recordings of people talking and they have the same F0 but different VTLs, the pitch (how high or low their voice is) will be the same, but the quality and characteristics of their voice will sound different – that’s the VTL at work!

The researchers used some fancy software to manipulate the F0 and VTL of the target sentences and the masker sentences so that, in each trial the subjects listened to, the target and masker sentences were more alike or less alike. They measured how well musicians and non-musicians were able to understand the target sentences based on how similar the target sentence was to the masker sentence in terms of these two parameters.

And here are the results!

FIG. 1A (reproduced below) shows the average percent of the sentence the subjects correctly repeated back with various differences in VTL and F0 between the target and masker sentence. The leftmost panel shows the smallest difference in VTL between the target and masker sentences (in the leftmost panel, there was no difference in VTL), and the rightmost panel shows the largest difference in VTL between the target and masker. Within a panel, going left to right increases the F0 difference between the target and masker sentences (so, within a panel, the leftmost points are where the target and masker sentences had the same average voice pitch as each other).

The data from the musicians is shown in purple and the data from the non-musicians is shown in green.


FIG. 1A from Baskent and Gaudrain


As you can see, both musicians and non-musicians were better able to understand the target sentence when the target sentence was “more different” than the masker sentence – if you look at the leftmost points in the leftmost panel (the hardest condition where there was no difference in F0 or VTL between the target and masker sentences), musicians had about 70% intelligibility and non-musicians had about 55% intelligibility. However, looking at the rightmost points in the rightmost panel (the easiest condition where there was the largest difference in both F0 and VTL between the target and masker sentences), both musicians and non-musicians did really well – better than 90% intelligibility. This makes a lot of sense – it’s easier to understand what a (high-pitched) child is saying when their speech is competing with a deep-voiced man compared to trying to understand what one child is saying when their speech is competing with another child.

And, regardless of how different the target and masker sentences were, musicians performed better than non-musicians – and a fairly substantial difference – you can see that the purple points are generally ~15-20 points higher than the green points.

Recall that the researchers wanted to know if a musician advantage was due to the musicians’ ability to detect very subtle pitch differences. Based on this data, it seems like the musician advantage might not primarily be due to musicians’ better pitch perception – in FIG. 1A above, the purple (musician) and green (non-musician) lines are parallel to each other, indicating that both groups were deriving equal benefit from larger pitch differences (larger differences in F0). So, it might be that the musicians are better than the non-musicians at focusing their auditory attention – after all, musicians do this all the time when they practice; for example, a musician in an orchestra has to both listen to what their section is playing as well as what the other sections are playing.

My Reflections

I couldn’t help relating the results of this study to my personal experiences! I started playing the violin and the piano when I was little (~6 years old), and played through college, although I haven’t played regularly since I finished college (many years ago).

I’ve long suspected that I’m much better at understanding speech in noise compared to my husband, G. (This is just a gut feeling, we haven’t thoroughly confirmed this). For example, when G and I go out to eat, I’m usually much better at simultaneously listening to him while eavesdropping on conversations next to us. If G wants to eavesdrop, he’ll have to stop talking to me and stop eating to focus his attention on what the people next to us are saying (while trying hard to look like he’s NOT paying attention to what they’re saying!). So, maybe it’s my childhood musical training that’s given me an edge here!









Article Review – “Infants’ listening in multitalker environments: Effect of the number of background talkers”

This week, I’m going to talk about this study (full text available!) looking at infants’ ability to listen in noise.  (Newman, R.S. “Infants’ listening in multitalker environments: Effect of the number of background talkers.” Attention, Perception, & Psychophysics. 71(4), 822-836, 2009).


As anyone who has tried to have a conversation in a noisy bar or restaurant can tell you, understanding speech in noisy environments, particularly when the noise is other people talking, is REALLY difficult.

Adults tend to do better at listening to a target talker when the competing noise is just one other talker compared to when the competing noise is several talkers all at once (like the din of a crowded restaurant). This difference could be for a couple of reasons. First, when the competing noise is just a single talker, adults may be able to recognize words or a topic of the competing talker, and use that context to selectively switch their attention away from the competing talker and toward the target talker. Secondly, speech naturally has pauses (like between syllables, phrases, or sentences), and adults may use pauses in a competing talker’s stream of speech to hone in on what a target talker is saying – with multiple talkers, the pauses tend to all average out so that there aren’t really any pauses (just a steady “roar”), which might make listening in the presence of multiple talkers more challenging for adults.

In this study, the researchers wanted to see if this is true for infants, as well. Note that the infants in this study were normally-hearing, and I’m not sure how the results would translate to infants with hearing loss.

The Study

The researchers had infants (an average age of about 5 months old) listen to a target stream of speech in the presence of competing speech. The target stream of speech consisted of a person saying a name, which could either be the infant’s name, a name other than the infant’s name that was similar (a “stress-matched foil”), or a name other than the infant’s name that wasn’t particularly similar (a “non-stress-matched foil”). The competing speech was either a single voice, or a composite of 9 voices all talking at the same time.

The researchers measured how long the infants listened to each name in the presence of the competing speech, the idea being that infants would listen for a longer duration of time to someone saying their own name if they recognized it. So, the researchers wanted to see if the infants listened longer during trials in which their name was said in the single-voice noise condition compared to the multi-voice noise condition to see whether infants were better able to recognize their own name in one condition versus the other.

And now, on to the results! FIG. 1 shows how long infants listened to their name compared to the other names in both a multi-voice competing speech condition (left-most panel) and a single-voice competing speech condition (middle panel).


Interestingly, the infants listened significantly longer to their own name compared to other names in the nine-voice noise condition but there was no difference  in the single-voice noise condition. This suggests that infants had more trouble understanding speech (in this case, recognizing their name) in the single-voice noise condition, which is the opposite of adults!

The researchers hypothesized that the infants might have had more trouble in the single-voice noise condition because they might have recognized the single-voice as speech and found it interesting, or possibly, because they recognized some of the words in the single-voice competing speech and therefore, focused on it. This is different than what an adult might do in the same situation – if an adult is trying to focus on one talker, but there is a single competing talker nearby, they might recognize words from each conversation and realize that the topics of each conversation are different. For example, the first talker might be saying words like “breakfast,” “pancakes,” and “eggs,” and the second talker might be saying words like “rain,” “umbrella,” and “soaked” – an adult listener might be able to use these words to identify topics of each conversation and they could then target their attention on the conversation they’re interested in (this all happens subconsciously, of course!). On the other hand, a baby might recognize a few words in each conversation, but might not have the vocabulary to group recognized words into topics, making the two conversations harder to disentangle. In the case of a multi-talker competing background noise, neither the adult nor the baby would recognize individual words in the background noise – this might be detrimental to the adult (who can’t segregate the noise from the target speech based on conversation topic or gaps in the noise), but might be helpful to the baby (who isn’t distracted by a competing talker that seems like they might be saying something interesting).

To try and address the issue of why the single-talker competing speech condition was so difficult for the infants, the researchers repeated this task, but using single-talker speech played BACKWARDS! In this case, the competing speech would have some acoustic properties similar to single-talker speech played forwards (e.g., gaps in the speech, changes in loudness, changes in pitch, etc.), but would be different in that the infants wouldn’t be able to recognize any words.

The results of this experiment are shown in FIG. 1 (above) in the right-most panel – as you can see, there was no difference in how long the infants listened to their own names versus other names in the single-talker speech played backwards condition. This indicates that the infants had a hard time recognizing speech in the presence of the single-talker backwards noise. This in turn suggests that the infants’ difficulty with understanding speech in the presence of a single competing talker is not due to recognizing some words in the competing speech and finding that distracting, but rather due to other characteristics of competing single-talker speech.

My Reflections

I thought it was so interesting that adults find a multi-talker background noise (like a restaurant) to be more difficult than a single competing talker but that infants are the opposite. I often extrapolate my experiences to T – if we are in a crowded restaurant, I assume he must have a harder time understanding what we’re saying than if there’s just one or two people talking nearby, because *I* find the crowded restaurant more difficult to listen in. It never occurred to me that it might be exactly the opposite for T!

This article also highlighted to me how much cognitive development is required for babies to mature to the point where they can listen to speech in noisy environments the way adults do. For example, they need to learn enough vocabulary to be able to group words in a conversation into topics, learn how to listen in the gaps of competing speech (like between sentences or phrases) to focus in on the target speech, and all sorts of other things – and all of this takes time and experience! I think this is especially important to remember because infants often spend a lot of their waking hours in environments that are very noisy – like daycare!

Additionally, this is yet another study that made me think about the importance of hearing aids for children with hearing loss – this study was done with normally-hearing infants, and they had a hard time understanding speech in noise – this difficulty must be so much worse for infants with hearing loss!


Article Review – “Vocalizations of Infants with Hearing Loss Compared with Infants with Normal Hearing: Part II – Transition to Words”

Last week, I talked about Part 1 of this study, which compared the initial, babbling stage of infant language development for infants with hearing loss and normally-hearing infants. This week, I want to talk about Part 2 of the study, which looked at how babies, as they got older, transitioned from babbling to producing words. Here’s a link to a full PDF of the study.


Part 1 of this study found that infants with hearing loss (HL) generally are delayed relative to normally hearing (NH) infants in the babbling stage of language development. HL infants took longer to begin babbling, and, once they began babbling, were slower to acquire particular types of consonants, such as fricatives (“sss,” “shhh,” “f,” etc.). The researchers wanted to then look at older babies to see whether HL infants were also delayed in transitioning from babbling to producing words relative to NH infants.

The Study

The infants included in this study were the same as those in Part 1 – to recap, there were 21 NH infants and 12 HL infants. The HL infants varied a lot in degree of hearing loss, and three received cochlear implants (CIs) during the course of the study. For all infants, language productions were monitored during play sessions with a caregiver (typically the infant’s mother), and these sessions were generally conducted every 6 weeks. In Part 2 of the study, data from sessions when the infants were between 10 and 36 months old were used.

Let’s get to the results!

The researchers analyzed the infants’ language productions during the sessions in 2 broad categories: the proportion of different utterance types at different ages and the structural characteristics of words produced at 24 months.

To look at the proportion of different utterance types at different ages, the researchers coded each utterance produced by an infant during a session as belonging to one of 3 utterance types:

  1. Non-communicative – these were speechlike sounds but were more vocal play than attempts to communicate. Examples include babbling that wasn’t directed to an adult.
  2. Unintelligible communicative attempts – these were vocalizations that were a) directed to an adult and b) served a communicative purpose, such as getting the adult to do something, seeking attention, etc. Some of these might have been attempts by the infant to say a particular word, but weren’t recognized by the caregiver or the researchers as a word.
  3. Words – the researchers pointed out that it’s tricky to decide what constitutes a word. For this study, utterances were classified as words if: 1) at least one vowel and consonant in the word attempted by the infant matched the “real” word (e.g., “baba” for “bottle”), 2) the utterance was a communicative attempt (see #2 above), and 3) it was clear that the child was attempting to say a word, for example, that the infant was imitating the parent or that the parent recognized the word and repeated it.

FIG. 1 of Moeller, et al. (reproduced below) shows the results of the analysis of utterance type for NH and HL infants at 16 months old and 24 months old.



FIG. 1 of Moeller, et al. – Proportions of different utterance types of NH and HL infants at 16 months and 24 months.

As you can see in FIG. 1, at a given age, the pattern of the proportion of different response types was different for NH infants compared to HL infants. For example, at 16 months, the NH infants were producing more unintelligible communicative attempts as well as more words compared to the HL infants. As another example, at 24 months, a greater fraction of utterances for the NH infants were words compared to the HL infants. Additionally, while both the NH infants and the HL infants produced more words at 24 months compared to 16 months, the researchers found that the magnitude of the increase was larger for NH infants. Interestingly, the researchers found that the pattern of utterance types for the HL infants at 24 months was similar to that of the NH infants at 16 months (I highlighted these in the red boxes in FIG. 1 above), indicating that the HL infants might have a similar pattern of improvement over time, but delayed.

To look at the structure of word attempts by the infants at 24 months, the researchers randomly selected 25 words from each child’s transcripts during the experimental session and compared the word attempt with the actual, target word to assess both the complexity of the word attempt and how accurate the attempt was. They computed 7 different metrics:

  1. Mean syllable structure level (MSSL) – this metric was used in Part 1, as well, and I described this in more detail here.  As a quick recap, words with only vowels were scored with 1 point, words with a single consonant type were scored with 2 points (e.g., “ba” or “baba”) and words with two or more consonant types were scored with 3 points (e.g., “bada” or “dago”).
  2. Percentage of vowels correct – this indicates the percentage of the time that the infant’s vowel productions in their word productions matched the “correct” vowels in the corresponding word. For example, if the target word was “mama,” the child would get 100% for saying “gaga” or “baba” but 0% for saying “momo.”
  3. Percentage of consonants correct (PCC) – this is similar to as above, but with consonants. As an example, if the target word was “shoe,” the child would get 100% for “shoo,” “shee,” “shaw,” etc., but 0% for “too.”
  4. Phonological mean length of utterance (PMLU) – This measure is intended to identify children who attempt longer, more complicated words but produce them less accurately as compared to children who attempt shorter, simpler words but produce them more accurately. To calculate this, the child received 1 point for each vowel and consonant produced, and an additional point for each correctly produced consonant. For example, if the target word was “cat,” at the child produced “cat,” they would receive 3 points for producing “c,” “a,” and “t,” and an additional 2 points for correctly producing the “c” and “t,” for a total of 5 points. However, if the child had instead produced “da” for “cat,” they’d receive only 2 points – one each for the “d” and “a,” but no points for accuracy. In this way, the PMLU reflects both accuracy of the word production as well as the length of the word.
  5. Proportion of whole word proximity (PWWP) – This measure is intended to give an overall reflection of how accurately the child produced a particular word. It is calculated by dividing the PMLU of the word attempt into the PMLU of the target word. As descried above, “cat” produced correctly would have a PMLU of 5, and “cat” produced as “da” would receive a PMLU of 2. Therefore, if a child produced “da” for “cat,” the corresponding PWWP would be 2/5, or 0.4.
  6. Word shape match – This measure indicates how accurate a child’s production of a word was in turns of shape/number of syllables. For example, if the target word was “cookie,” the target shape would be “consonant-vowel-consonant-vowel.” (CVCV). If, instead of producing a word that had a CVCV shape, the child produced one with just a CV shape (e.g., “di,” “koo,” “da,” etc.), this would not be a match.
  7. Words with final consonants – Word productions were given points for this metric if a target word ended with a consonant, and the child’s production of the word also ended with a consonant, even if the consonant wasn’t totally accurate. So, for example, if the target word was “goat,” the child would get points for producing “goat,” “got,” “god,” “goad,” etc.

The results of the structural analysis of the children’s word productions are shown in Table 1 of Moeller, et al. (reproduced below).


Table 1 of Moeller, et al. – comparison of word structure for NH and HL children at 24 months old.

This table shows that, for every measure of word structure (the rows in the table), the NH hearing children performed better than the HL children. The difference between the NH and HL children was statistically significant (this is indicated by the crosses in next to the score for the NH children in each row).

One of the things that I really like about this table is that they indicate the Effect Size for each metric of word structure (this is indicated in the right-most column of the table). The effect size tells you the strength of of the finding. For the measure of effect size used in this paper (called “Cohen’s d”), an effect size of around 0.2-0.3 is considered a small effect, an effect size of around 0.5 is considered a medium effect, and an effect size of more than 0.8 is considered a large effect. So, as you can see from this table, for every metric of word structure, the researchers found that not only was there a statistically significant difference in performance between the NH children and the HL children, but that the size of this difference was large.

So, overall, the data in Table 1 indicates that compared to age-matched NH children, HL children were producing words that were less complex (contained fewer different types of consonants, were less likely to end in a consonant, and were shorter) and that tended to be less accurate representations of the target word (an incorrect number of syllables or producing an incorrect vowel or consonant).

The researchers also looked at the number of words each child could produce as a function of age. FIG. 4 of Moeller, et al. (reproduced below) shows this data (the top two panels of FIG. 4 show data from this study; the bottom two panels show data from two other studies for comparison). The number of words was determined by asking the child’s caregiver to fill out an evaluation at home at each time point.


FIG. 4 of Moeller, et al. – The number of words produced by HL children (top left panel) and NH children (top right panel) as a function of age.

In FIG. 4, you can see that the curves for the NH children (top right panel) are both steeper and shifted to the left compared to the curves for the HL children (top left panel). This indicates that the NH children began producing words at a younger age relative to the HL children, and that, once they began producing words, their vocabularies expanded at a faster rate. The researchers noted that there was considerable variability in the data (for example, you can see that some of the NH children had much shallower curves than others, indicating that they were acquiring words more slowly than their peers), but that the individual data collected in this study “suggest a much slower rate of early vocabulary development compared with NH children.”  (Moeller, et al. p. 636).

One cool thing – in the panel for the HL children (upper left), the curves with unfilled symbols indicate children with CIs – one of the best performing children in this group had a CI! I thought this was pretty remarkable!

Since there was so much variability within the group of HL children regarding degree of hearing loss, the researchers weren’t really able to say much about how degree of hearing loss affected language production in this study.

My Reflections

T has been babbling up a storm for a few months now, but this paper made me think about the different contexts of his babbling (e.g., non-communicative, unintelligible communicative, and words/word attempts). Of course, at this age, T’s babbling is essentially entirely non-communicative or unintelligible communicative (and no words/word attempts). Reflecting on these distinctions, I think that T tends to babble in the non-communicative category primarily when he’s relaxing – like riding in the stroller or in his crib at night (or the wee hours of the morning) – at these times, he’ll go on a long, uninterrupted soliloquy, complete with big variations in vocal inflection. T’s babbles that fall in the unintelligible communicative category seem to happen when we’re playing interactively with him (to tell us to do something again), when he wants something (usually food), or when he’s excited about something (he’ll often shout “DAY-DA!” while looking at us when he’s excited – usually when we open the refrigerator door). I think the distinction in types of communication based on activity/mood makes sense – if non-communicative babbling is a form of vocal play, (that is, allowing T to play with making different sounds), it makes sense that this would come most naturally to him when he’s just chilling.

At 10 months, T is on the young age compared to the ages of the children studied here. However, I think he’s allllllmost on the cusp of his first word. At least a couple times, it seems like he was fairly consistently saying “a-ga” for “again” (to ask us to do something again) and saying “bah-bol” for “bubble” (to ask us to blow more bubbles). I’m not sure these are consistent enough to count as his first word (for example, he’ll say “a-ga” at other times too), but it seems like he might be close. We try to really reinforce when we think he’s saying something that might have meaning – for example, if he says “dada” and it seems plausible that he’s saying something to or about his dad, we’ll make a big production of saying the word “dad.” We do the same thing for “again” and “bubble,” and I think this repetition is helping him connect the sound of the word to the concept/object.

One thing this study made me excited about – I didn’t realize how rapidly vocabulary grows once children start talking! I get the feeling that T is thinking some pretty fun thoughts (like when he starts grinning when he sees the trash can and races over to look inside), and I can’t wait to hear what he’s thinking once he starts talking.


Article Review – “Vocalizations of Infants With Hearing Loss Compared with Infants with Normal Hearing – Part 1: Phonetic Development”

As T’s (9 months) babbling has taken off, I’ve started to become interested in the order in which infants tend to acquire different speech sounds as well as how this might differ for infants with hearing loss vs. normally-hearing infants. I started doing a little Googling, and found this study (link to Abstract only) that compares vocalizations of infants with hearing loss to infants with normal hearing. (Moeller, M.P., Hoover, B., Putman, C., Arbataitis, K., Bohnenkam, G., Peterson, B., Wood, S., Lewis, D., Pittman, A., and Stelmachowicz, P. “Vocalizations of Infants with Hearing Loss Compared with Infants with Normal Hearing: Part 1- Phonetic Development.” Ear & Hearing, Vol. 28 No. 5, 605-627. 2007).

This study actually has two parts, the first looking at babbling with younger infants (up to 2 years old), and the second looking at older children and how they transition from babbling to acquiring words. This week, I’ll talk about part 1, and will hopefully write about part 2 next week.


It’s well-known that infants with hearing loss develop spoken vocabulary later than normally-hearing children. However, there’s a lot of language development happens before children start speaking words. For example, infants typically start off making vowel sounds, and then progress to babbling (like “bababa,” “dadada,” etc.). Less is known about how hearing loss affects this earlier stage of language development.

The Study

The researchers enrolled a group of normally-hearing (NH) infants and a group of infants identified as having hearing loss (HL). This was a longitudinal study, so each infant was followed over time – their spoken language was measured in an experimental session conducted every 1.5-2 months, from when the study began (generally when infants were 4 months old) until they were 36 months old. There were 21 NH infants, and 12 HL infants. All of the infants with HL had assistive technology, typically hearing aids, although 3 received cochlear implants (CIs) during the course of the study. The degree of hearing loss varied a lot for the HL group; on average, across the group of HL infants, they had a 67 dB HL Better Ear Pure Tone Average (BEPTA – meaning that the average audiogram for the infant’s better ear measured at 500, 1000, and 2000 Hz was 67 dB HL). All of the HL infants were involved in some form of early intervention.

To collect the data, at each session, each infant played with a caregiver while their interaction was taped and then transcribed. The infants would play with a parent or guardian, and the researchers transcribed each vocalization by the infant – for example, identifying a particular vowel or consonant, whether a sound was a grunt, cry, or squeal, etc.

There were 3 main categories of metrics the researchers looked, which were:

  1. Volubility – this indicates how much the infants vocalized over a session – were they pretty chatty during the session, or fairly quiet?
  2. Age at which the infant began babbling
  3. Speech complexity – here, the researchers looked at what types of consonants the infants were producing at a particular age, as well as whether they were able to string different types of sounds together to make more complex sounds.

Let’s get to the results!


To measure volubility, for each experimental session, the researchers calculated the infant’s vocalizations per minute. Vocalizations could be any sounds other than stuff like grunts, screams, cries, etc. So, an infant with a higher volubility score would have vocalized more during the session compared to an infant with a lower volubility score.

FIG. 1 of the article (shown below) shows the volubility results for both NH infants (left) and HL infants (right). In the figure, volubility scores are shown for infants at 3 different ages – 8.5 months, 10 months, and 12 months. As you can see in FIG. 1, the volubility scores for HL infants was really similar to that of NH infants, and the researchers found no significant difference between the two groups. I thought it was pretty interesting that, at each age, the HL infants seemed to be vocalizing as much as the NH infants!


FIG. 1 of Moeller, et al. – Volubility of NH and HL infants as a function of age

Age of Babbling Onset

The researchers then quantified the age at which the infants began babbling. Although we (or at least, I!) tend to think of babbling as any infant pre-word “talking,” babbling technically requires a consonant-vowel (CV) pairing – examples include “ba,” “da,” “ga,” etc. CV pairs can also be chained together, either the same consonant and vowel (“baba”) or different consonants and/or vowels (“babo,” “bada,” etc.)

In order to set a criteria to define the age of babbling onset, the researchers identified the age at which the proportion of babbles out of the total vocal utterances exceeded 0.2 – so this was the age at which, during an experimental session, more than 20% of the infant’s vocalizations were consonant-vowel pairings.

FIG. 2 of the article (shown below) shows, at each age, the proportion of infants in the NH group (black bars) and HL group (white bars) who had started babbling (defined as having more than 20% of their vocalizations during the session include a CV-pairing). As you can see, NH infants tended to begin babbling much earlier than HL infants – it took roughly 6ish additional months for the HL group to reach the milestone of having 50% of the infants in the group babbling compared to the NH group. The researchers also stated that, for the HL group, there was a correlation between the age at which the infants first received hearing aids and the age at which they began babbling, although this wasn’t statistically significant (possibly because there were only 12 infants in the group, and they varied a lot in degree of hearing loss).


FIG. 2 of Moeller et al. – Proportion of infants who had began babbling by age

Babble Complexity

The researchers quantified the complexity of the sounds the infants were producing by scoring each utterance as follows:

  1. 1 point for utterances that were vowels or primarily vowels – (like “ahhh,” “eeee,” “waaa,” etc.) – this was labeled SSSL1
  2. 2 points for utterances that had 1 type of consonant – (like “ba,” “da,” “baba,” etc.) – this was labeled SSSL2
  3. 3 points for utterances that had 2 or more types of consonants – (like “bada,” “gaba,” gabo,” etc.) – this was labeled SSSL3
  4. 4 points for utterances with consonant blends, like “spun.” – this was labeled SSSl4

FIG. 4 of Moeller et al. shows the proportion of utterances that belonged to each point category for both NH infants (top) and HL infants (bottom).


Adapted from FIG. 4 of Moeller et al. – proportion of utterances in each babble complexity category as a function of age

As you can see, for both NH infants and HL infants, vocalizations by the youngest babies (10-12 months) were dominated by the simplest type of vocalization – primarily vowels. Both groups tended to increase the proportion of more complex vocalizations – those containing consonants and multiple types of consonants – with age. One really interesting thing you can see in the above figure is that HL infants at 18-20 months had a babble complexity pattern that was similar to the NH infants at 10-12 months (I highlighted these in the red boxes above) – this is a pretty substantial delay. However, by the time the HL infants were 22-24 months old, the pattern resembles that of the NH infants at 18-20 months (highlighted in the green boxes above), indicating that the HL infants were closing the gap! This could be the result of amplification for the HL infants, early intervention services, as well as the fact that three of the HL infants received cochlear implants during this time period.

Phonetic Inventory

The researchers then looked at whether NH infants and HL infants differed in the rates at which they started saying vowels and different types of consonants. FIG. 5 of Moeller et al. (reproduced below) shows the infants’ progression in acquiring both vowels and consonants broken into different classes based on place of articulation. A consonant’s place of articulation indicates what part of the mouth is involved in obstructing the vocal tract – I wrote more about it here. Here’s a quick overview of the different classes of consonants shown in FIG. 5 below:

  1. bilabials – these are consonants produced with the lips pressed together (e.g., p, b, m, and w).
  2. labiodentals & interdetals – labiodentals are produced with the lower lip between the teeth (e.g., f and v). interdentals are produced with the tongue between the teeth (e.g., th).
  3. alveolars – these are produced with the tip of the tongue behind the top teeth (e.g., d and t).
  4. palatals – these are produced with the body of the tongue raised against the hard palate (e.g., j).
  5. velars – these are produced with the back part of the tongue against the soft palate (e.g., k and g).

Each panel in FIG. 5 shows the percent of sounds within a given category that the infants produced at a particular age. So, for example, there are 4 bilabial consonants (p, b, m, and w), and infants who could produce 2 out of the 4 at a particular age would receive a score of 50% for that age.


Adapted from FIG. 5 of Moeller, et al. – % of sounds produced in different phonetic categories as a function of age.

One thing that was interesting to me is that bilabial consonants seemed to be one of the “easier” sounds to produce in general (look at the top row, middle panel) – for both NH and HL infants, scores were fairly high at every age range, and the gap between NH and HL infants was fairly small as well. The researchers said that this might be because bilabial consonants tend to be very visually salient compared to other places of articulation – it’s pretty easy to see lips pressed together compared to where your tongue is inside your mouth! This might make it easier for infants to acquire bilabial consonants, since they can more easily see how they are formed.

Another interesting thing about Fig. 5 – the researchers found that acquisition of these different classes of sounds generally fell into 3 different categories, which I’ve highlighted by color in the above figure. For vowels and alveolar consonants, the HL infants were generally delayed relative to the NH infants, but their rate of acquisition was parallel (this is highlighted in blue above). For bilabial consonants and velar consonants, the HL infants seemed to be closing an initial gap relative to the NH infants – that is, their acquisition of these classes of consonants was converging with the NH infants (this is highlighted in green above). Conversely, for palatal consonants and labiodentals/interdetals, the HL infants seemed to be acquiring consonants in these classes at a slower rate than the NH infants – that is, over time, the gap between the HL infants and the NH infants widened. One thing to note is that, for both NH and HL infants, palatal and labiodental/interdental consonants (highlighted in red above) occurred less often in general compared to other consonants – regardless of hearing, children tend to take longer to produce these types of sounds, perhaps because they tend to be less common in English.

The researchers then broke the consonants up in a different way – into fricatives and non-fricatives. Fricatives are consonants that are produced by forming a small opening with the mouth and forcing air through – they include sounds like “ssss,” “shhhh,” “f,” and “zzz” – fricatives are the ones that sound kind of “hissy”! This hissyness also makes fricatives generally hard for people with hearing loss to hear – fricatives tend to have a lot of high frequency components and are often low in intensity. FIG. 6 of Moeller, et al. (reproduced below) shows the rate of acquisition of non-fricatives (left) and fricatives (right) for both NH and HL infants.


FIG. 6 of Moeller, et al. – Acquisition of non-fricative and fricative consonants.

As you can see, acquisition of the non-fricative consonants was parallel for both the HL and NH infants – both groups had a steady increase in production of non-fricative sounds. However, for fricatives, while the NH infants steadily increased their production of these sounds, the HL infants didn’t – they seemed sort of stuck from 10 months to 24 months and, in general, didn’t really add many consonants from this group into their repertoire. As I mentioned above, this might be because fricatives tend to be really hard to hear for people with hearing loss, so the HL infants might have not had enough exposure to these types of sounds to begin producing them.

My Reflections

I was particularly interested to read this study since T’s consonant inventory seems to have grown a lot just in the past 2 weeks. Although he’s been saying “da” for awhile (EVERYTHING is “dada”!), he’s started more consistently saying “ba” and “ma” (both are bilabial) and, just in the past few days, has started saying “la” (I think this is alveolar). From the data presented in this study, it seems like bilabials tend to be one of the “easiest” categories of consonants – babies tend to produce the highest proportion of consonants in this class at earlier ages relative to other categories, and this might be because of how easy it is to see the lips pressed together when producing bilabial consonants. Although T’s preferred consonants (the ones we hear more often) are “da” (alveolar) and “ga” (velar), I think we’ve heard him produce most of the bilabial consonants at least a few times now. And, lately, if we really emphasize the position of our lips while saying “pa,” “ba,” or “mmm,” he’ll try to imitate us!

One of the things I think I gained from reading this study was an appreciation for the activities we do at speech therapy and a deeper understanding of how those activities will help T acquire different speech sounds. One thing we really focus on is drawing T’s attention to different sounds by pairing the sound with something interesting and visually salient – this gets him to really listen to the sound rather than just have it be background noise that he might not pay attention to. We’ll do this in different ways, for example, pointing at our mouths, waving toys or ribbons around as we make the sound, etc. I think that, especially for children with hearing loss, merely passively hearing different sounds isn’t quite enough, and having their attention drawn to the sound and the way your mouth looks when you make the sound can help tie everything together.

Once again, this study highlighted the importance of T wearing his hearing aids! I think it’s really important for him to get as much good, high-quality exposure to all these different speech sounds so that he can start to produce them, and this is especially important for fricatives (like, “sss,” “shhh,” “f,” etc.). The “s” sound in particular is really important for English grammar – it’s what turns a singular noun into a plural – and the study that I wrote about here found that children with hearing loss tend to have more trouble with this grammar rule than normally-hearing children.

Finally, on a happy (for me) note – there are a few bad words I’ve been known to accidentally say in front of T that start with fricatives (I’ll let you figure out what they are) – I’ve been thinking I need to clean up my language, since I’m been worried that once T really starts talking, he’ll out me by repeating something he’s heard me say totally out of the blue. But, from the results of this study, it looks like children, whether normally-hearing or with hearing loss, don’t tend to really start producing fricatives until they are quite a bit older than T is now – so it looks like I have a little while before I have to be worried about T surprising me by dropping a fricative-bomb!

Article Review – “Voice Emotion Recognition by Cochlear-Implanted Children and Their Normally-Hearing Peers”

This week, I’m going to talk about a new study (PDF available for free through the link) by Chatterjee et al. (2015) that looked at how well adults and children can identify vocal emotion and how each group compares to their peers. (Chatterjee, M. Zion, D.J., Deroche, M.L., Burianek, B.A., Limb, C.J., Goren, A.P., Kulkarni, A.M., and Christensen, J.A. “Voice Emotion Recognition by Cochlear-Implanted Children and Their Normally-Hearing Peers.” Hearing Research (322), 2015, 151-162).


Detecting and identifying emotions in speech is really important for communication and social interaction. For example, if you’re talking with someone, and they mention that they just bought new pants, it’s important to be able to identify any subtext underlying their statement. Are they excited that they finally had time to go shopping? Are they angry that they spilled coffee all over their old pants? Are they sad to admit a favorite pair will no longer button? Identifying the emotion behind the statement is crucial to knowing how to respond appropriately! And, identifying the emotion isn’t just important for following-up; one study has even found that the ability of children to identify vocal emotion is correlated with their assessment of quality of life [1].

In a face-to-face conversation, facial expressions can aid in identifying vocal emotions. However, it’s harder in non-face-to-face conversation, such as on the phone. In those situations, we rely entirely on acoustic cues to distinguish different emotions from each other. These acoustic cues can include stuff like how fast we talk, pitch, how our pitch changes over the course of a sentence, and loudness.

Cochlear Implants (CIs) convey some of these cues better than other cues. For example, CIs tend to convey speaking rate very well but they are pretty bad at conveying pitch and changes in pitch accurately. (This is a fairly complex topic, and I don’t want to get too into the weeds here, so for now I’ll leave it at that).

Since identifying vocal emotion could potentially rely on many different acoustic cues, some of which are not accurately conveyed by CIs, Chatterjee et al. wanted to measure how well CI users could identify vocal emotion in speech. They looked at both children (who were pre-lingually deafened), and adults (who were, for the most part, post-lingually deafened, and therefore acquired language as children prior to receiving a CI).

The Study

The researchers studied 4 groups of people: normally-hearing children, children with CIs, normally-hearing adults, and adults with CIs. All of the participants were asked to listen to several sentences, and, for each sentence, identify whether the emotion underlying the sentence was happy, sad, scared, angry, or neutral. Although the sentences were neutral in content (an example is “her coat is on the chair”), the sentences were spoken by one of two talkers who were instructed to speak the sentence using one of the five emotions, and to really exaggerate the emotion. Sentences were recorded by one male talker and one female talker.

This article has a mountain of interesting results, but I’m going to focus on a few results that I found particularly interesting – I definitely encourage you to check out the article and look at the rest of the results yourself!

CI users (children and adults) had more trouble identifying vocal emotions than their normally-hearing peers


FIG. 5 of Chatterjee, et al – vocal emotion recognition scores for all test subject groups

The above figure (FIG. 5 from the article) shows the performance of each group (adults with normal hearing [aNH]; adults with cochlear implants [aCI]; children with normal hearing [cNH]; and children with cochlear implants [cCI]). Since there were 5 choices of emotion for each sentence, if a participant had guessed randomly, they would have scored 20% correct (this is marked in the figure by the black horizontal line). As you can see, on average, all of the groups did well above chance. However, while the normally-hearing participants, both adults and children, got almost 100% correct, the CI users had more trouble. The researchers found that the children with cochlear implants performed worse than both adults and children with normal hearing and, in general, similarly to adults with cochlear implants.

Another interesting thing you can see in the figure is the effect of the gender of the talker – in particular, CI users did worse identifying emotion for the male talker compared to the female talker. This is especially true for the adult CI users. One note of caution on this result though – the study only used sentences spoken by 1 male and 1 female, so this data isn’t enough to extrapolate CI users ability to recognize emotion for male talkers vs. female talkers in general.

Emotions that were easily confused & corresponding acoustic cues

The graph above (FIG. 5 from the article) shows that CI users did worse at identifying emotions than the normal hearing participants, but that’s for all emotions lumped together. The researchers also looked at what emotions the participants were likely to confuse for each other – for example, is happy often mistaken for scared?

One way to look at which emotions are confused for each other is by constructing a confusion matrix from the responses. Here’s an example of the confusion matrices for the male talker for adults (top matrix) and children (bottom matrix) with CIs (adapted from FIG. 10 of Chatterjee et al.)


Adapted from FIG. 10 of Chatterjee, et al. – confusion matrices for adult (top) and children (bottom) CI users for the male talker.

Each block in the confusion matrix indicates the number of times the emotion indicated in the column header was identified as the emotion indicated in the row header (averaged over all participants in each group). There were 12 sentences spoken with each emotion, so if a particular group (for example, adults with CIs) were to get a perfect score, the diagonal entries would all say “12.” Instead, in the two confusion matrices shown above, you can see that the diagonal values are higher than the off-diagonal values, but none of the entries are 12, indicating that none of the emotions were correctly identified by CI users 100% of the time.

If we look at off-diagonal entries with relatively high values, we can see which emotions were often confused with one another. I highlighted one example in red – “happy” and “scared.” (“Angry” and “neutral” is another pair that tended to be confused by CI users for the male talker). Note that these are only the responses for the male talker – FIG. 10 in the article shows confusion matrices for both male and female talkers and for both CI users and normally-hearing participants.

After looking at which emotions tended to be confused with each other, I think it’s interesting to see which acoustic cues tend to differentiate the easily confused emotions to see if it makes sense that CI users would confuse them. In this study, the authors looked at how 5 different acoustic cues vary for different emotions. Before I talk about those results, I’ll quickly explain the cues that the study analyzed:

  1. Mean F0 Height – F0 stands for “fundamental frequency.” Mean F0 height basically means the average pitch of the talker’s voice. So, a bass mean F0 height is lower than a soprano’s and male mean F0 height tends to be lower than female mean F0 height.
  2. F0 Range – This indicates how much the pitch of a talker’s voice varies over a sentence. If, over the course of the sentence, the speaker’s voice goes up and down a lot, they’d have a relatively high F0 range. Conversely, if they speak in a monotone, they’d have a lower F0 range.
  3. Duration – This is pretty simple – more quickly spoken sentences will have a shorter duration.
  4. Intensity Range – This indicates how much the speaker’s voice varies in loudness over the sentence
  5. Mean dB SPL – This indicates the average loudness over the course of the sentence

And here are graphs (adapted from FIG. 1 of Chatterjee, et al.) showing how the acoustic cues vary for the different emotions. Although there’s a lot of interesting information in here, I’m just going to focus on the male talker’s duration and F0 range for the “happy” and “scared” sentences, since those two tended to be confused, as discussed above.

acoustic cues.jpg

Acoustic cues for different emotions – adapted from FIG. 1 of Chatterjee, et al.

As you can see from the red boxes in the figure above, the male talker tended to speak “happy” and “scared” sentences with similar durations (look at the red boxes in the panel in the middle row, left column). However, he tended to vary pitch a lot more for “happy” sentences than for “scared” sentences (look at the red boxes in the top right panel labeled “F0 range”). Recall that duration tends to be conveyed well through the CI. However, variations in a speaker’s pitch (how much their voice goes up and down) tend to not be conveyed well through the CI. So, for the male talker, “happy” and “scared” were very similar to each other in a cue that is easy for CI users to use (duration), but they varied a lot in a cue that is hard for CI users to use (F0 range) .

This suggests that CI users tend to confuse emotions that vary primarily in acoustic cues that are not well-conveyed by the CI. (I want to be careful to not overstate this too much: I’m only looking at one pair of emotions that were easily confused for one of the two talkers. Also, the data in the article were produced based on just one male talker and just one female talker, so it’s possible that other talkers vary acoustic cues differently for different emotions – the authors have since collected data from many more talkers, so hopefully we will know more about acoustic cues underlying different emotions soon!)

Comparison of CI users to their peers using a CI-simulator

Chatterjee et al. tested normally-hearing adults and children using a CI simulator to compare the performance in the CI simulation to the actual performance by the CI users. This might sound sort of strange – why simulate the CI users when they collected actual data from the CI users?! One reason is that this particular type of CI simulation, the vocoder, lets us look at a particular type of deficit faced by CI users called spectral resolution. Here’s one way to think about spectral resolution – imagine banging on a piano with a ball – using a smaller ball corresponds to having better spectral resolution (because the smaller ball hits fewer keys), and using a larger ball corresponds to having worse spectral resolution (because the larger ball hits more keys). Using the vocoder, we can see how having better or worse spectral resolution affects performance on a particular task, in this case, identifying vocal emotion. This lets us see whether spectral resolution is important at all for performing the task, as well as how improving spectral resolution might improve performance.

One of the main parameters we can vary in the vocoder is the “number of channels.” Let’s go back to the ball example – 4 channels in the vocoder might correspond to banging on the piano with a basketball (worse spectral resolution), whereas 16 channels might correspond to using a golf ball (better spectral resolution). Although neither ball sounds great, you can imagine that the golf ball is better. This link has examples of what vocoded speech sounds like for different numbers of channels (scroll down to section 2) – if you listen to the sentences there, you’ll notice that it’s pretty easy to understand the sentence with 15 channels, but it’s really hard with 1 or 5 channels.

Ok, so back to the study – Chatterjee et al. tested normally-hearing adults and children using the vocoder with different numbers of channels – adults listened to 4 (worst spectral resolution), 8, and 16 (best spectral resolution) channels, and children only listened to 8 channels. Here’s a figure (adapted from FIG. 6 of Chatterjee, et al.) showing the results:


Performance with a CI simulation – adapted from FIG. 6 of Chatterjee, et al.

If you look at the red and blue boxes in the figure above, you can see that both adults and children with CIs performed similarly to normally-hearing adults listening to a simulator with 8 channels (a medium amount of spectral resolution), and that a simulator with 16 channels (making the spectral resolution better) would have improved performance for at least the female talker.

I think the most interesting thing about this figure is how poorly normally-hearing children listening to the CI-simulator did! Notice that their scores (highlighted by the green box) are much worse than the adults listening to the 8-channel simulator, AND, interestingly, much worse than the children with CIs! This indicates the huge benefit that children with CIs are receiving – they are performing, at least with respect to vocal emotion identification, like adults with CIs, and much better than normally-hearing children listening to a CI-simulator (probably because the children with CIs hear everything in daily life through the CI, whereas it probably takes time for children listening to a simulator to adapt to the sound of the simulations).

My Takeaways

If you’ve read this far – thank you! (Or maybe you’re my husband reading this under duress? Hi, G!)

I think this study has interesting implications for speech therapy for children with CIs – it’s clear from this data that at least some children have trouble identifying different vocal emotions, and focusing on this in some way might go a long way towards overcoming this deficit.

This study only looked at children with CIs, so it’s not clear from this whether children with milder hearing loss who wear hearing aids face the same problems. From interacting with T (9 months, with a mild hearing loss), I think he definitely notices different vocal emotions – for example, he will look up very attentively if I start talking in an angry or frustrated way (umm, not that that happens a lot!), and he’ll stare at me with huge eyes. Also, if my husband and I start talking in an excited way, he’ll sometimes “join in” by smiling and squealing. Although he of course can’t yet label different emotions, I think he’s definitely picking up on some of the acoustic cues underlying them (although, in all of these examples, he’s also certainly picking up on our facial expressions and body language, as well.).


[1] Schorr, JA. Roth, FP. Fox, NA. “Quality of Life for Children with Cochlear Implants: Perceived Benefits and Problems and the Perception of Single Words and Emotional Sounds.” Journal of Speech, Language, and Hearing Research. Vol. 52, 141-152. 2009.


Article Review – “Statistical Learning by 8-Month-Old Infants”

Between last week’s article review and this week’s article review, I seem to be on a bit of a kick talking about infant language development. Don’t worry, I have something totally different in mind for next week!

Like last week, the article I’ll talk about here is a classic, and it describes an elegant body of research that changed how scientists think about how infants acquire language. The article is “Statistical Learning by 8-Month-Old Infants”  and is available for free as a PDF through the link. (Saffran, J.R., Aslin, R.N., Newport, E.L. “Statistical Learning by 8-Month-Old Infants.” Science, Vol. 274, No. 5294, 1926-1928, 1996). Note that the infants studied had normal hearing, and I’m not sure how the results would change with infants with hearing loss.


The study described in the article looked at how infants learn to segment a stream of speech into words – that is, identifying which chunks in a stream of speech constitute a word (rather than a syllable, a phrase, a sentence, etc.). When I first heard about the question underlying this study, my initial reaction was that this is silly -aren’t the words marked by pauses on either side (like how words are marked by spaces in written text)? It turns out this isn’t true at all!

For example – here’s the sound signal from me speaking the sentence “I really like Mississippi.” (I chose this sentence because of the variability in the number of syllables per word, not out of any particular fondness for Mississippi; I’ve actually never been to Mississippi!).


As you can see there, sometimes the word boundaries line up with the pauses in the signal, such as between the words “I” and “really.” Other times, there’s pretty much no pause between words, such as between the words “really” and “like.” And other times, there are large pauses within a word, such as in the word “Mississippi.” So, pauses or gaps are really not a good indicator of word boundaries!

Before I keep going, I can’t resist sharing a little anecdote – I had never considered how difficult the problem of identifying word boundaries in speech is until I watched my husband try and learn Tamil, the language that my family speaks. He would hear someone say something like “I ate an ice cream cone last Tuesday,” (but in Tamil), and he would ask me questions that were the equivalent of “what’s an eamco? what does asttue mean?” I would get so frustrated, because I had no idea what he was asking! (I’m a little embarrassed to admit that more than one visit with my family included me yelling “THAT’S NOT A WORD!” at my husband.).

All of this to say – segmenting a stream of speech into word chunks is actually really hard, even though we seem to learn to do this effortlessly in our native languages! Saffran et al. studied how infants learn to segment a stream of speech into words.

The Study

One potentially powerful cue that could be used to identify word boundaries are statistical regularities in how frequently one sound tends to follow another in a stream of speech. Saffran et al. give the example of the phrase “pretty baby” – over a huge corpus of speech (like what you might hear spoken over many days), the sound “ty” is more likely to follow the sound “pre” than the sound “ba” is to follow “ty.” If you were to keep track of how likely different sounds are to follow other sounds, over time, you might figure out that “pretty” is one word chunk and “baby” is another word chunk (of course, just knowing how likely one sound is to follow another doesn’t give you any idea of the meaning of the word; that’s a different problem to solve!).

Saffran et al. had previously showed that adults can use these probabilities to learn to segment speech into words, so they wanted to extend this work to see if infants can also use this information.

In Experiment 1, the researchers created a made-up language that had 4 nonsense words that each had 3 syllables. The words were: tupirogolabubidaku, and padoti. (These were the words for one of two conditions; they tested two groups of children with two different sets of words, to make sure that children didn’t have a bias for particular nonsense words). They then played a continuous stream of speech that consisted of these 4 words repeated in random order for 2 minutes. So, the speech might have sounded like “tupirogolabubidakupadotigolabubidaku…” The words were spoken in a monotone, and there were no pauses or stresses on particular syllables that might have indicated where word boundaries were. Note that, since there were no pauses between words and no other acoustic cues (tone, stress, etc.), the only difference between words and non-words were how frequently one syllable followed another. So, for example, “ku” always followed “da” (in bidaku) with 100% probability, but “pa” would only follow “ku” in the case where the words padoti followed bidaku (with 33% probability) – this would indicate that “da” and “ku” go together, whereas “ku” and “pa” generally don’t.

They then tested whether the infants (8 months old) could distinguish the nonsense words in this made-up language from non-words (that is, 3-syllable groups of sounds that weren’t any of the 4 words in the made-up language). The researchers created a test set that consisted of two of the nonsense words (tupiro and golabu) and two similar non-words that the infants had never heard in the stream of speech (dapiku and tilado). Note that all of the syllables in the non-words were present in the words in the stream of speech, but not in the same order. For example, “da” and “pi” were both syllables there were heard in the stream, but “pi” never followed “da.”

The infants were then tested to see whether they could discriminate between the words and non-words. They did this by seeing whether infants paid attention for longer after hearing the non-words (that weren’t present in the stream of speech they had listened to earlier) compared to the words that they had heard – the idea here is that infants pay attention longer to stimuli (sounds, visual objects, etc.) that they haven’t heard or seen before compared to stimuli that they’re familiar with. And, the researchers found that the infants did in fact pay attention for longer after hearing the non-words (on average, almost a full second longer). This indicates that they had learned what syllables should follow each other (in the example above, that “ku” should follow “da”), even after listening to the stream of speech for only 2 minutes!

But, merely knowing the order in which syllables should go isn’t enough to segment a stream of speech into words. For example, with the stream “tupirogolabubidaku…,” knowing the syllable order doesn’t tell you whether a word is golabu or bubida. The researchers conducted a second experiment, where the test set consisted of 2 words and two “part-words”. Each of the part-words were also three syllables, and were created by joining the final syllable of a word with the first two syllables of a different word (e.g., bubida – a combination of golabu and bidaku). In this case, the infants might have heard the part-words in the stream of speech – for example, there is some chance that bidaku could follow golabu, and in that case, the infant would hear bubida.  However, they would hear bubida far less frequently than they would hear either golabu or bidaku because “bi” is relatively unlikely to follow “bu,” since it spans a word boundary.

Experiment 2 was trickier than Experiment 1, since the non-words were combinations that the infants would have heard in the 2 minute stream, albeit less often than the words. Even with the increased difficulty of the task, the infants were still able to distinguish the words from the non-words!

These two experiments show that infants use statistics underlying speech sounds (how frequently one sound tends to follow another) to build up a mental representation of language, and they can use that to help them learn where word boundaries occur in speech. What’s more, they can do this VERY rapidly (in this case, after listening to just 2 minutes of a stream of nonsense words).

My Takeaways

Reading this study, it’s totally amazing to me that such young children (8 month old infants) were able to glean so much information after listening to a stream of speech consisting of words they had never heard before, for such a short time.

One of the things T’s speech therapist first said to us when T was only 2 months old is that we should talk to him A LOT – we should narrate what we’re doing, have “conversations” with him even though he wasn’t really responding, etc. We tried hard to do this, but honestly, it gets a little tiresome to basically just be talking to yourself – and I kind of wondered, what’s the point? Is T getting anything out of this? After reading this study, I think all of that talking must be really important! After all, to be able to build up a statistical model of language, T needs to hear lots and lots of examples of lots of different words. This also highlights the importance of T wearing his hearing aids – if it’s hard for T to hear the difference between two sounds (e.g., “sa” and “fa”), it will be hard for him to build up a mental model of how often one of those sounds follows another.






Article Review – “Cross-Language Speech Perception”

I wanted to talk about a really cool article that I first read in grad school about infant language development. Although this article was published in 1984, it’s from a classic series of experiments, and is very relevant to T at his current age (8.5 months). Note that the infants studied in the article all had normal hearing, and I’m not sure how the results would change for infants with hearing loss!

The article is “Cross-Language Speech Perception: Evidence for Perceptual Reorganization During the First Year of Life” and is available for free as a PDF through the link. (Werker and Tees. “Cross-Language Speech Perception: Evidence for Perceptual Reorganization During the First Year of Life.” Infant Behavior and Development, Vol. 7, Pages 49-63, 1984).


All languages have consonants – consonants are speech sounds that are articulated with a partial or full closure of the vocal tract. One defining feature of a consonant is its place of articulation, or where in the vocal tract the obstruction occurs. For example, the consonant sounds “p” and “b” are called “bilabial” because both lips close to form the obstruction. Another example is “alveolar” consonants, where the tongue presses against the gum ridge just behind the upper teeth – examples of alveolar consonants are “d” and “t”.

Place of articulation can be a defining feature for distinguishing two consonants – for example, a key difference between “ba” and “da” is that “ba” is made with the lips pressed together and “da” is made with the tongue pressed up against the alveolar ridge (just behind the upper teeth). “Ba” and “da” are also considered “contrastive” in English – this means that if you substitute one for the other in a word, the meaning changes (for example, “bang” and “dang”). So, in English, a bilabial articulation (“ba”) is contrastive with an alveolar articulation (“da”), but there are other pairs of places of articulation that are not contrastive in English.

For example, Hindi has a “retroflex” place of articulation – this position is created by curling the tongue backward toward the hard palate, and is made in conjunction with sounds that are similar to the English consonants of “t” and “d.” In Hindi, the retroflex articulation is contrastive with a “dental” articulation, where the tongue is pressed just behind the upper teeth. So, in Hindi, a “t” sound can be made with the tongue just behind the upper teeth (“dental”), or with the tongue curled way back (“retroflex”), and these two types of “ts” are different letters, and one substituted for the other in a word creates a different word.

English doesn’t have a retroflex consonant (only 11% of languages have retroflex consonants!) – in fact, native adult English speakers can’t really even hear the difference between a retroflex “t” and a dental “t” (which are different sounds and different letters in Hindi) – they tend to label both the retroflex “t” and the dental “t” as an alveolar articulation, which corresponds to the “t” sound in English.

What’s really interesting is that newborn babies (<6-8 months old) in native English-speaking families, can hear the difference between a retroflex “t” and a dental “t” – and sometime between early infancy and adulthood, they lose this ability. The general idea is that all babies, regardless of language spoken at home, are born with the ability to hear the difference between all these different consonant contrasts, and based on the language they hear around them, the brain “prunes” out the ability to hear the contrasts that aren’t needed to learn their native language. Werker and Tees studied when infants lose this ability.

The Experiment

In this post, I’ll focus on Experiment 2 described in the article. The authors had previously found that 6-8 month old infants could discriminate (or, hear the difference) between the Hindi retroflex “t” (which I’ll lablel here as tr) from the Hindi dental “t” (which I’ll label here as (td). In a pilot study, they found that English-speaking 4-year old children performed similarly to English-speaking adults – that is, they couldn’t hear the difference. So, in Experiment 2, the authors looked at whether 8-10 month old infants being raised in English-speaking homes could hear the difference between tr and td, and whether 10-12 month old infants raised in English-speaking homes could hear the difference between tr and td. They also compared the results to those of babies being raised in Hindi-speaking homes.

To test whether the babies could hear the difference between tr and td, they used a conditioned head-turn procedure. A string of one of the consonants was played in a loop, and then suddenly changed to be the other consonant (for example, “tr tr tr tr td“). The babies were conditioned to turn their head to look at a toy animal when they detected a change in the consonant being played. (This is actually kind of similar to the procedure for Visually Reinforced Audiometry used to measure audiograms in babies!)

The authors found that, of the babies being raised in English-speaking homes, most of the 6-8 month old infants could discriminate tr and td, some of the 8-10 month old infants could discriminate tr and td, and only a few of the 10-12 month old infants could discriminate tr and td. The 10-12 month old infants were significantly worse at discriminating the two consonants than either the 6-8 month old infants and the 8-10 month old infants. Additionally, the authors found that all of the 10-12 month old babies being raised in Hindi-speaking homes could discriminate tr and td. The figure below shows the proportions of infants in the different age groups that could discriminate these two consonants (it’s FIG. 4 from the article).


(Note that, in the above figure from the article, the graphs show both results using the Hindi consonants, as well as two consonants from a different language called Salish. Additionally, the top row of graphs (labeled “cross-sectional data”) shows results from different babies, and the bottom row of graphs (labeled “longitudinal data”) shows results from the same group of babies followed over time from 6-8 months through 10-12 months).

The results of this study show that infants up until 6-8 months of age can hear the difference between consonants that aren’t contrastive in their native language, but that they lose this ability somewhere between 8-12 months of age. A lot of important language development happens in the first year of life!

Testing Myself, My Husband, and My Baby

I had originally read this article because I was interested in T’s ability to discriminate retroflex and dental consonant since he’s in the interesting 8-10 month old age range where he might or might not be able to hear the difference.

The Adults – Me and My Husband

Before testing T, I decided to see whether my husband and I could hear the difference. I found synthesized audio files of retroflex and dental consonants here (although the synthesized consonants were the retroflex and dental “d” consonants rather than “t” consonants as used in the article; the retroflex/dental “d” consonants are also present in Hindi). If you click on that link, you can hear the consonants that I used (they’re labeled CV1 and CV7) – can you hear the difference?

To test myself and my husband, I randomly played either the same consonant (both retroflex or both dental) or different consonants and asked whether they were the same or different. If we were just guessing randomly, we’d get about 50% correct.

Let’s start with me – after testing myself on two sets of 20 comparisons, I got 47% correct – just as if I’d guessed randomly – it turns out, I can’t hear the difference between retroflex and dental consonants at all! And now for my husband – over two sets of 20 comparisons, he got 90% correct! He said that the difference was in the beginning part of the consonant – they sounded like they had different “attacks,” and that this difference was very clear to him.

It kind of irritated me that my husband was so much better than me at hearing the difference! So, I did a little digging – this [1] study by Pruitt et al. found that adult native Japanese speakers are better than adult native English speakers at hearing the difference between Hindi retroflex and dental consonants. They hypothesized that this is because Japanese contains a consonant contrast that is similar to the retroflex/dental contrast in Hindi (in Japanese, it’s the /d/ vs. flapped /r/, which is sometimes produced as a retroflex consonat). My husband went to a Japanese immersion school for several years as a child, so my current hypothesis is is that the reason he could discriminate the Hindi consonants so much better than me is his early exposure to Japanese!

The Baby

To test T, I first tried to see if he could tell the difference between the English “ba” and “da.” These two consonants also have a difference in place of articulation, but since T hears this every day, he should be able to tell the difference, regardless of age. I produced a stream of “ba”s and then switched to “da” (or vice versa), and he quickly looked up, indicating that he’d noticed the difference. (This is different than the way the babies were tested in the article, since I don’t have the same equipment that they had!).

I then repeated this with the Hindi retroflex consonants. A few times, it seemed like his attention shifted coincident with the change in the consonant, indicating that he might have heard the difference. However, there were also several instances where his attention didn’t shift at all. It’s hard to say whether this is because he didn’t hear the difference, or because his attention was drawn elsewhere (for example, to the laptop producing the sounds!).

Overall, I can’t say whether or not T can discriminate between Hindi retroflex and dental consonants – it might be that he can, but that the way I tested him wasn’t thorough enough to detect his ability to discriminate the two consonants. Alternatively, he might not be able to discriminate the two consonants, but we don’t know whether that’s because he previously had this ability and has since lost it with age (as in the case of the babies studied in the article), or whether he was never able to discriminate the two consonants (since I never tested him on this when he was younger). I wish I had done this earlier, when he was 6 months old, to see how he’d reacted then!


[1] Pruitt J.S., Jenkins J.J., and Strange W. “Training the perception of Hindi dental and retroflex stops by native speakers of American English and Japanese.” J. Acoust. Soc. Am., 119, 1684, 2006.