Voice Capture Challenges: The Threshold of the Golden Age of Voice

Michael Klasco
May 10, 2020
3 min read

Updated: Feb 19, 2022

In this article, Mike Klasco provides an introduction to the voice capture, signal processing and voice recognition challenges.

Voice processing and recognition leading to the development of voice-based personal assistants is one of the decade’s greatest achievements and provides a unique opportunity for the audio industry.

Audio product development, and audio magazines for that matter, have traditionally been all about music reproduction, but recently voice has taken center stage. Audio signal processing has been divided into two separate camps — voice and music processing. I think we can credit the iPhone for multi-tasking telephony and music and making it mainstream.

As Bluetooth protocols with enhanced sound quality along with stereo brought us wireless headphones paired to smartphones, which added fuel to the fire, we are reaching full ignition with voice assistance emerging in everything from smart speakers to soundbars, smart wall outlets, and now smart TVs.

Voice processing has been with us since the start of Bell Labs, which was initially a telephony research department at AT&T’s Western Electric set up in 1907. While Bell Labs has undertaken a wide range of research, voice was its primary focus from encryption vocoders to compressor/limiters and uplink and down link noise filtering to transducer refinements.

The radio broadcast industry has always been obsessed with extracting the maximum range from their transmitters — as advertising rates are calculated by the listener coverage area. In the 1950s, the CBS Volumax was the first effective broadcast automatic gain control (AGC) but its original purpose was to eliminate the signal level watchman — a technician whose only job was to prevent transmitter over-modulation. This overload would blow up the expensive transmitter tubes not to mention over-modulation would incur the wrath of the Federal Communications Commission (FCC).

It did not take long for radio stations to also realize the dynamically compressed signal traveled to the fringe reception areas more clearly. Decades later, it was multi-band limiters such Orban’s Optimod and the Aphex Compeller that ruled the broadcast waves. In the 1970s, the back rooms of intelligence agencies were using sophisticated forensic voice processors such as those from Rockwell (now Rockwell Collins) to extract intelligible speech from noisy signals.

Today, you don’t have to look behind secret doors to find some of the most sophisticated voice processing — it is in your pocket — in your smartphone. More than 1.6 billion cell phones are sold each year and that kind of number is hard to beat in the consumer electronics industry.

From telephones to smartphones with voice recognition, the evolution of voice processing technologies is now the result of more than a century of research.

Smartphones are big on signal processing, voice for telephony communications, and also for music playback, gaming, and movies. Voice processing is the unsung hero in mobile audio devices —significantly enhancing voice call quality, intelligibility, and speech recognition (command) accuracy. Voice processing enables natural full-duplex communication in speakerphones, suppressing echo and acoustic feedback, and beamforming the mics for clear speech conversations in noisy places.

The vendors of these voice processing technologies operate in a fiercely competitive market, which drives continuous product design improvements. Further innovative R&D efforts will no doubt lead to better, faster, less expensive solutions that will benefit mobile device users in years to come.

Some of the processing takes place in the voice co-processor/codec (e.g., from Qualcomm and Cirrus Logic), but the processing smarts are distributed around a bit. Some processing, such as active noise cancelation (ANC), is starting to appear in the USB-C audio signal chain. And, most smartphones drive the receiver (earpiece) and speaker (speakerphone) with a smart amp, which monitors excursion and voice coil heat and limits operation to the safe operating area (SOA) of the speaker system.

The Dolby Conference Phone with a Dolby voice-enabled service provides participants the benefit of voice placement: each speaker’s voice is presented from a distinct location, making it easier to follow the conversation.

The bottom line is smart amps enable manufacturers to get more sound out of tiny speakers crammed into shrink-wrapped enclosures without distortion, death, and destruction. Most of the leading semiconductor amp chip vendors such as NXP, Qualcomm, Texas Instruments (TI), and Maxim now offer smart amps.

Read the complete article, originally published in audioXpress, December 2018, now available online