Humans are fairly good at classification. We begin learning, at a very young age, to differentiate between the sound of a door closing, a sibling laughing, a dog-barking, etc. Nobody describes to us what these things sound like, or gives us a guidebook on how to differentiate between them. Parents don’t “teach” these skills, per se. Rather, we learn by example. And even though every dogs’ bark is different, we pick up on those underlying similarities between a poodle and a dane, and thereby develop the skill of classification. After about 1,000-2,000 days on earth, humans are capable, with a very high degree of accuracy, of distinguishing between a bark and a non-bark.
Until very recently, computers were not very good at this. Algorithms were guided by explicit, pre-set rules, incapable of learning through unstructured examples. But machine learning has changed this. “Deep learning”, or “neural networks” (the name and method is inspried by how the human brain itself works) have not only caught up with humans in terms of classification abilities – in many cases they have surpassed humans. Whereas a child can only listen to so many barks before she grows bored (or old), a computer doesn’t mind the tedious nature of learning by example. And thanks to the combination of large datasets and ever-increasing processing power, computers can learn to distinguish patterns which are undetectable to us mere mortals.
Image classification is the area in which there has been most progress in recent years. You’ve probably seen it, for example, when facebook suggests that you tag a specific person based on a photo. You’ve also probably contributed to it, for example, when a google security check asks that you identify photos with certain characteristics – they’re not only checking that you’re human, they’re also using you to do free labor (classify images so as to “teach” their computers). Image classification can do seemingly trivial tasks – like distinguish between two faces on an instagram post, but it can also do things like detect tumors invisible to the eye or guide a self-driving car.
Audio is a bit tricker than image classifiction. But image classification techniques can be leveraged on sound by converting sound to sight. That is, one can convert an utterance (a word, a cough, a song) to a visual representation, and use that visual representation in machine learning.
Â
A picture of sound
The most common way to visualize sound is the spectogram. It shows pitch (or frequency) on the y-axis, time on the x-axis and uses color to represent intensity (volume). It’s a fairly simple visualization, but also very unfamiliar to those of us who are used to hearing, not seeing, sound.
Pretty cool, right? Sound is kind of beautiful to see. But it’s also very information-rich. When you hear a piano, you perceive the melody, the up and down movements of pitch, the silence between the keystrokes, and perhaps the tone of the instrument itself. But you only perceive a small subset of all of the sound data that is generated. In fact, human cognition actively filters out most information – it’s a bandwidth thottling function, and without it we would be simply overwhelmed with our senses. Perhaps you’ ve seen this video of a basketball game, which shows just how good (or bad) are brains are at filtering information.
What Does a Cough Look Like? – Some Examples
Okay, so sound is pretty to look at, and it contains a lot of information in its visual form. But what does that have to do with coughing and what does a cough look like?
A lot, actually. Converting the sound of a cough to a visual representation allows for the use of advanced image recognition techniques. This has applications not only in differentiating between a cough and background noise, but also potentially in using each cough’s acoustic signature to better understand the health of the person who coughs.
By teaching AI how to distinguish between a cough and ambient noise, we can track cough frequency over time and space. And since visualized audio offers so much data (which we over-efficient humans don’t even notice), we can also teach AI to recognize patterns in coughs. Whereas a human might categorize a cough in simplistic terms like “wet” or “dry”, a machine can generate categories which are imperceptible to humans, unbounded by our linguistic limitations, far beyond our abilities.
But to get there, we have to turn sounds into pictures. Let’s have a look.
Keyboard Clickety-Clackety
Here is some clickety-clackety on a keyboard:
You’re probably used to seeing the wave-form of sounds (on the left), and not the spectogram (right). In both, one can distinguish clearly the 4 clicks. But volume only is not sufficient for high-quality classification. We also need frequency (pitch). Unlike a wave-form plot, a spectogram shows a third dimension (through color).
Â
A Baby’s Squeal
Let’s have a look at another sound: a baby’s squeal:
In terms of decibels-only, it looks like this:
But the primary differentiator between a baby’s squeal and an adult’s (do adults squeal?) is not volume, but pitch. Thus, the utility of the specotgram.
Coughs
Let’s have a look at 3 coughs, from 3 different people. You’ll note some similarities, both listening and seeing, across all coughs. They start with an explosive increase in volume, and fade-out more slowly than they fade-in.
Cough 1 (below) is a prolonged cough with a slight uptick in volume towards the end, withone final expiratory contraction from the diaphragm.
Cough 2 (below) is more archetypical. A steep abrupt explosion in sound followed by a disminuendo.
Cough 3 (below) is also fairly typical, albeit less pronounced than 2, both in terms of duration and decibel variation.
What’s notable between coughs 2 and 3 is how similar they are in terms of decibel profiles, but how different they are in terms of frequency.
Â
Wrap-up
Maybe by now the novelty of seeing sounds as spectograms has worn off. But that feeling you have – that desire to go do something else rather than keep staring at spectograms – computers don’t get that feeling. And that’s why we use computers, and not humans, to do deep learning. They can look at thousands and thousands of images of sounds and detect patterns in them, patterns which we are neither patient nor detailed-oriented enough to perceive. Once they’ve seen enough examples, they can “predict” on images (made from sounds) they’ve never seen. Just as a child can hear a bark from a dog species they’ve never encountered and still say “that’s a bark”, a computer can be trained to detect a cough and, with time and sufficient examples, perhaps differentiate between different kinds of coughs. There are a lot practical applications to this, ranging from diagnostics, to medication adherence, to public health surveillance. But it all starts with turning sound into pictures.