Article Review – “Voice Emotion Recognition by Cochlear-Implanted Children and Their Normally-Hearing Peers”

This week, I’m going to talk about a new study (PDF available for free through the link) by Chatterjee et al. (2015) that looked at how well adults and children can identify vocal emotion and how each group compares to their peers. (Chatterjee, M. Zion, D.J., Deroche, M.L., Burianek, B.A., Limb, C.J., Goren, A.P., Kulkarni, A.M., and Christensen, J.A. “Voice Emotion Recognition by Cochlear-Implanted Children and Their Normally-Hearing Peers.” Hearing Research (322), 2015, 151-162).


Detecting and identifying emotions in speech is really important for communication and social interaction. For example, if you’re talking with someone, and they mention that they just bought new pants, it’s important to be able to identify any subtext underlying their statement. Are they excited that they finally had time to go shopping? Are they angry that they spilled coffee all over their old pants? Are they sad to admit a favorite pair will no longer button? Identifying the emotion behind the statement is crucial to knowing how to respond appropriately! And, identifying the emotion isn’t just important for following-up; one study has even found that the ability of children to identify vocal emotion is correlated with their assessment of quality of life [1].

In a face-to-face conversation, facial expressions can aid in identifying vocal emotions. However, it’s harder in non-face-to-face conversation, such as on the phone. In those situations, we rely entirely on acoustic cues to distinguish different emotions from each other. These acoustic cues can include stuff like how fast we talk, pitch, how our pitch changes over the course of a sentence, and loudness.

Cochlear Implants (CIs) convey some of these cues better than other cues. For example, CIs tend to convey speaking rate very well but they are pretty bad at conveying pitch and changes in pitch accurately. (This is a fairly complex topic, and I don’t want to get too into the weeds here, so for now I’ll leave it at that).

Since identifying vocal emotion could potentially rely on many different acoustic cues, some of which are not accurately conveyed by CIs, Chatterjee et al. wanted to measure how well CI users could identify vocal emotion in speech. They looked at both children (who were pre-lingually deafened), and adults (who were, for the most part, post-lingually deafened, and therefore acquired language as children prior to receiving a CI).

The Study

The researchers studied 4 groups of people: normally-hearing children, children with CIs, normally-hearing adults, and adults with CIs. All of the participants were asked to listen to several sentences, and, for each sentence, identify whether the emotion underlying the sentence was happy, sad, scared, angry, or neutral. Although the sentences were neutral in content (an example is “her coat is on the chair”), the sentences were spoken by one of two talkers who were instructed to speak the sentence using one of the five emotions, and to really exaggerate the emotion. Sentences were recorded by one male talker and one female talker.

This article has a mountain of interesting results, but I’m going to focus on a few results that I found particularly interesting – I definitely encourage you to check out the article and look at the rest of the results yourself!

CI users (children and adults) had more trouble identifying vocal emotions than their normally-hearing peers


FIG. 5 of Chatterjee, et al – vocal emotion recognition scores for all test subject groups

The above figure (FIG. 5 from the article) shows the performance of each group (adults with normal hearing [aNH]; adults with cochlear implants [aCI]; children with normal hearing [cNH]; and children with cochlear implants [cCI]). Since there were 5 choices of emotion for each sentence, if a participant had guessed randomly, they would have scored 20% correct (this is marked in the figure by the black horizontal line). As you can see, on average, all of the groups did well above chance. However, while the normally-hearing participants, both adults and children, got almost 100% correct, the CI users had more trouble. The researchers found that the children with cochlear implants performed worse than both adults and children with normal hearing and, in general, similarly to adults with cochlear implants.

Another interesting thing you can see in the figure is the effect of the gender of the talker – in particular, CI users did worse identifying emotion for the male talker compared to the female talker. This is especially true for the adult CI users. One note of caution on this result though – the study only used sentences spoken by 1 male and 1 female, so this data isn’t enough to extrapolate CI users ability to recognize emotion for male talkers vs. female talkers in general.

Emotions that were easily confused & corresponding acoustic cues

The graph above (FIG. 5 from the article) shows that CI users did worse at identifying emotions than the normal hearing participants, but that’s for all emotions lumped together. The researchers also looked at what emotions the participants were likely to confuse for each other – for example, is happy often mistaken for scared?

One way to look at which emotions are confused for each other is by constructing a confusion matrix from the responses. Here’s an example of the confusion matrices for the male talker for adults (top matrix) and children (bottom matrix) with CIs (adapted from FIG. 10 of Chatterjee et al.)


Adapted from FIG. 10 of Chatterjee, et al. – confusion matrices for adult (top) and children (bottom) CI users for the male talker.

Each block in the confusion matrix indicates the number of times the emotion indicated in the column header was identified as the emotion indicated in the row header (averaged over all participants in each group). There were 12 sentences spoken with each emotion, so if a particular group (for example, adults with CIs) were to get a perfect score, the diagonal entries would all say “12.” Instead, in the two confusion matrices shown above, you can see that the diagonal values are higher than the off-diagonal values, but none of the entries are 12, indicating that none of the emotions were correctly identified by CI users 100% of the time.

If we look at off-diagonal entries with relatively high values, we can see which emotions were often confused with one another. I highlighted one example in red – “happy” and “scared.” (“Angry” and “neutral” is another pair that tended to be confused by CI users for the male talker). Note that these are only the responses for the male talker – FIG. 10 in the article shows confusion matrices for both male and female talkers and for both CI users and normally-hearing participants.

After looking at which emotions tended to be confused with each other, I think it’s interesting to see which acoustic cues tend to differentiate the easily confused emotions to see if it makes sense that CI users would confuse them. In this study, the authors looked at how 5 different acoustic cues vary for different emotions. Before I talk about those results, I’ll quickly explain the cues that the study analyzed:

  1. Mean F0 Height – F0 stands for “fundamental frequency.” Mean F0 height basically means the average pitch of the talker’s voice. So, a bass mean F0 height is lower than a soprano’s and male mean F0 height tends to be lower than female mean F0 height.
  2. F0 Range – This indicates how much the pitch of a talker’s voice varies over a sentence. If, over the course of the sentence, the speaker’s voice goes up and down a lot, they’d have a relatively high F0 range. Conversely, if they speak in a monotone, they’d have a lower F0 range.
  3. Duration – This is pretty simple – more quickly spoken sentences will have a shorter duration.
  4. Intensity Range – This indicates how much the speaker’s voice varies in loudness over the sentence
  5. Mean dB SPL – This indicates the average loudness over the course of the sentence

And here are graphs (adapted from FIG. 1 of Chatterjee, et al.) showing how the acoustic cues vary for the different emotions. Although there’s a lot of interesting information in here, I’m just going to focus on the male talker’s duration and F0 range for the “happy” and “scared” sentences, since those two tended to be confused, as discussed above.

acoustic cues.jpg

Acoustic cues for different emotions – adapted from FIG. 1 of Chatterjee, et al.

As you can see from the red boxes in the figure above, the male talker tended to speak “happy” and “scared” sentences with similar durations (look at the red boxes in the panel in the middle row, left column). However, he tended to vary pitch a lot more for “happy” sentences than for “scared” sentences (look at the red boxes in the top right panel labeled “F0 range”). Recall that duration tends to be conveyed well through the CI. However, variations in a speaker’s pitch (how much their voice goes up and down) tend to not be conveyed well through the CI. So, for the male talker, “happy” and “scared” were very similar to each other in a cue that is easy for CI users to use (duration), but they varied a lot in a cue that is hard for CI users to use (F0 range) .

This suggests that CI users tend to confuse emotions that vary primarily in acoustic cues that are not well-conveyed by the CI. (I want to be careful to not overstate this too much: I’m only looking at one pair of emotions that were easily confused for one of the two talkers. Also, the data in the article were produced based on just one male talker and just one female talker, so it’s possible that other talkers vary acoustic cues differently for different emotions – the authors have since collected data from many more talkers, so hopefully we will know more about acoustic cues underlying different emotions soon!)

Comparison of CI users to their peers using a CI-simulator

Chatterjee et al. tested normally-hearing adults and children using a CI simulator to compare the performance in the CI simulation to the actual performance by the CI users. This might sound sort of strange – why simulate the CI users when they collected actual data from the CI users?! One reason is that this particular type of CI simulation, the vocoder, lets us look at a particular type of deficit faced by CI users called spectral resolution. Here’s one way to think about spectral resolution – imagine banging on a piano with a ball – using a smaller ball corresponds to having better spectral resolution (because the smaller ball hits fewer keys), and using a larger ball corresponds to having worse spectral resolution (because the larger ball hits more keys). Using the vocoder, we can see how having better or worse spectral resolution affects performance on a particular task, in this case, identifying vocal emotion. This lets us see whether spectral resolution is important at all for performing the task, as well as how improving spectral resolution might improve performance.

One of the main parameters we can vary in the vocoder is the “number of channels.” Let’s go back to the ball example – 4 channels in the vocoder might correspond to banging on the piano with a basketball (worse spectral resolution), whereas 16 channels might correspond to using a golf ball (better spectral resolution). Although neither ball sounds great, you can imagine that the golf ball is better. This link has examples of what vocoded speech sounds like for different numbers of channels (scroll down to section 2) – if you listen to the sentences there, you’ll notice that it’s pretty easy to understand the sentence with 15 channels, but it’s really hard with 1 or 5 channels.

Ok, so back to the study – Chatterjee et al. tested normally-hearing adults and children using the vocoder with different numbers of channels – adults listened to 4 (worst spectral resolution), 8, and 16 (best spectral resolution) channels, and children only listened to 8 channels. Here’s a figure (adapted from FIG. 6 of Chatterjee, et al.) showing the results:


Performance with a CI simulation – adapted from FIG. 6 of Chatterjee, et al.

If you look at the red and blue boxes in the figure above, you can see that both adults and children with CIs performed similarly to normally-hearing adults listening to a simulator with 8 channels (a medium amount of spectral resolution), and that a simulator with 16 channels (making the spectral resolution better) would have improved performance for at least the female talker.

I think the most interesting thing about this figure is how poorly normally-hearing children listening to the CI-simulator did! Notice that their scores (highlighted by the green box) are much worse than the adults listening to the 8-channel simulator, AND, interestingly, much worse than the children with CIs! This indicates the huge benefit that children with CIs are receiving – they are performing, at least with respect to vocal emotion identification, like adults with CIs, and much better than normally-hearing children listening to a CI-simulator (probably because the children with CIs hear everything in daily life through the CI, whereas it probably takes time for children listening to a simulator to adapt to the sound of the simulations).

My Takeaways

If you’ve read this far – thank you! (Or maybe you’re my husband reading this under duress? Hi, G!)

I think this study has interesting implications for speech therapy for children with CIs – it’s clear from this data that at least some children have trouble identifying different vocal emotions, and focusing on this in some way might go a long way towards overcoming this deficit.

This study only looked at children with CIs, so it’s not clear from this whether children with milder hearing loss who wear hearing aids face the same problems. From interacting with T (9 months, with a mild hearing loss), I think he definitely notices different vocal emotions – for example, he will look up very attentively if I start talking in an angry or frustrated way (umm, not that that happens a lot!), and he’ll stare at me with huge eyes. Also, if my husband and I start talking in an excited way, he’ll sometimes “join in” by smiling and squealing. Although he of course can’t yet label different emotions, I think he’s definitely picking up on some of the acoustic cues underlying them (although, in all of these examples, he’s also certainly picking up on our facial expressions and body language, as well.).


[1] Schorr, JA. Roth, FP. Fox, NA. “Quality of Life for Children with Cochlear Implants: Perceived Benefits and Problems and the Perception of Single Words and Emotional Sounds.” Journal of Speech, Language, and Hearing Research. Vol. 52, 141-152. 2009.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s