The sound of impending failure

Sound is an incredibly valuable means of communicating information. Most motorists are familiar with the alarming noise of a slipping belt drive. My grandfather could diagnose issues with the breaks on heavy rail cars with his ears. And many other experts can detect problems with common machines in their respective fields just by listening to the sounds they make.

If we can find a way to automate listening itself, we would be able to more intelligently monitor our world and its machines day and night. We could predict the failure of engines, rail infrastructure, oil drills and power plants in real time — notifying humans the moment of an acoustical anomaly.

This has the potential to save lives, but despite advances in machine learning, we struggle to make such technologies a reality. We have loads of audio data, but lack critical labels. In the case of deep learning models, “black box” problems make it hard to determine why an acoustical anomaly was flagged in the first place. We are still working the kinks out of real-time machine learning at the edge. And sounds often come packaged with more noise than signal, limiting the features that can be extracted from audio data.

The great chasm of sound

Most researchers in the field of machine learning agree that artificial intelligence will rise from the ground up, built block-by-block, with occasional breakthroughs. Following this recipe, we have slayed image captioning and conquered speech recognition, yet the broader range of sounds still fall on the deaf ears of machines.

Behind many of the greatest breakthroughs in machine learning lies a painstakingly assembled dataset. ImageNet for object recognition and things like the Linguistic Data Consortium and GOOG-411 in the case of speech recognition. But finding an adequate dataset to juxtapose the sound of a car-door shutting and a bedroom-door shutting is quite challenging.

“Deep learning can do a lot if you build the model correctly, you just need a lot of machine data,” says Scott Stephenson, CEO of Deepgram, a startup helping companies search through their audio data. “Speech recognition 15 years ago wasn’t that great without datasets.”

Crowdsourced labeling of dogs and cats on Amazon Mechanical Turk is one thing. Collecting 100,000 sounds of ball bearings and labeling the loose ones is something entirely different.

And while these problems plague even single-purpose acoustical classifiers, the holy grail of the space is a generalizable tool for identifying all sounds, not simply building a model to differentiate the sounds of those doors.

Appreciation through introspection

Our human ability to generalize makes us particularly adept at classifying sounds. Think back to the last time you heard an ambulance rushing down the street from your apartment. Even with the Doppler effect, the changing frequency of sound waves affecting the pitch of the sirens you hear, you can easily identify the vehicle as an ambulance.

Yet researchers trying to automate this process have to get creative. The features that can be extracted from a stationary sensor collecting information about a moving object are limited.

A lack of source separation can further complicate matters. This is one that even humans struggle with. If you’ve ever tried to pick out a single table conversation at a loud restaurant, you have an appreciation for how difficult it can be to make sense of overlapping sounds.

Researchers at the University of Surrey in the U.K. were able to use a deep convolutional neural network to separate vocals from backing instruments in a number of songs. Their trick was to train models on 50 songs split up into tracks of their component instruments and voices. The tracks were then cut into 20-second segments to create a spectrogram. Combined with spectrograms of fully mixed songs, the model was able to separate vocals from backing instruments in new songs.

But it’s one thing to divide up a five piece song with easily identifiable components, it’s another to record the sound of a nearly 60 foot high MAN B&W 12S90ME-C Mark 9.2 type diesel engine and ask a machine learning model to chop up its acoustic signature into component parts.

Acoustic frontiersman

Spotify is one of the more ambitious companies toying with the applications of machine learning to audio signals. Though Spotify still relies on heaps of other data, the signals held within songs themselves are a factor in what gets recommended on its popular Discover feature.

Music recommendation has traditionally relied upon the clever heuristic of collaborative filtering. These rudimentary models skirt acoustical analysis by recommending you songs played by other users with similar listening patterns.

Filters pick up harmonic context as red and blue bands at different frequencies. Slanting indicates rising and falling pitches that can detect human voices, according to Spotify

Outside of the controlled environment of music, engineers have proposed solutions that broadly fall into two categories. The first I’m going to call the “custom solutions” model, which essentially involves a company collecting data from a client with the sole purpose of identifying a pre-set range of sounds. Think of it like build-a-bear but considerably more expensive and typically for industrial applications.

The second-approach is a “catch-all” deep learning model that can flag any acoustical anomaly. These models typically require a human-in-the-loop to manually classify sounds which then further train the model on what to look for. Over time these systems require less and less human intervention.

Link :

Leave a Reply

Your email address will not be published. Required fields are marked *