Voice control is the future. Just like touch-screen technology revolutionized smartphones, and the ability to use gestures launched new gaming and fitness applications, the possibilities for hands-free operation is just beginning. This new interface will not only influence smartphones and tablets it will become part of the user experience for TV’s and other electronic devices.
The increased availability of voice control applications can be attributed to a number of factors: Advances in ASR (automatic speech recognition), cloud computing that enables very high processing power with more robust and reliable algorithms, noise cancellation that reduces ambient noise and language processing technologies.
However, while ASR-enabled applications work well in quiet environments, their performance tends to degrade drastically in the presence of background noise in noisy cafés, on public transportation or even when walking on a busy street. Without intelligible speech, automatic voice recognition can’t function properly or be considered as a reliable input device.
Three-dimensional voice processing enables ASR to achieve far better accuracy by suppressing background noise while preserving the natural voice of the speaker with only minor distortion. The degradation is so minor that the user experience is hardly affected when operating applications such as Siri or a text message dictation application in noisy public venues.
How 3D processing works
Traditional noise cancellation suffers from trade-offs between the degree of noise reduction and voice quality: the higher the noise reduction levels, the greater the potential for voice distortion. Attempting to minimize the trade-offs, engineers have developed noise reduction algorithms to reduce the amount of noise which perform well mainly in stationary noise and poor performance in non-stationary noise such as street noise and similar other noises.
Noise cancellation technique took a leap forward with the introduction of a second microphone in smartphones, enabling both microphones to operate in similar manner to the human auditory system. However, this capability does not provide sufficient noise cancellation to eliminate all background noise for voice calls or voice control, while driving or riding on public transportation, or even at home when, for instance, music is turned up loud.
Advanced noise-cancellation technology uses an additional sensor in addition to the standard two audio microphones, and then applies a 3D-vocal algorithm to perform multiple voice processing tasks including background noise cancellation, loudness equalization and general voice enhancement. Removing background noise significantly improves the accuracy rate of automatic speech recognition and voice-call applications for smartphones, tablets and other mobile devices.
An example of how the advanced noise cancellation affects the noisy speech is shown in Figure 1.The upper waveform illustrates the noisy speech which is the superposition of speech and ambient noise (S+N), while the lower waveform shows the clean resulting speech signal after 3D voice processing.
Figure 1: Typical 3D voice processing results on speech and ambient noise
Figure 2 shows a spectrogram, where the upper graph presents the spectrogram of the noisy speech S+N, the lower spectrogram shows the resulting speech signal after 3D voice processing.
Figure 2: Spectrogram of 3D processing on speech and ambient noise
Improved quality for voice communication application by using 3D voice processing
By incorporating the advanced noise cancellation capabilities into smartphones for voice communication, the voice quality can be significantly improved from “poor” to “very good.”
The audio quality of 3D voice processing was compared with standard 2D noise cancellation techniques using the ETSI EG 202 396-1 standard, which defines a method to test quality of noise reduction algorithms objectively. The scale for general quality (GMOS), is on a sliding scale of 1 to 5, with 1 being “bad” and 5 being “excellent.”
Voice quality was compared according to the MOS scores using a standard smartphone with built-in 2D process in different types of noisy environments. As shown in figure 3, the score of the 3D voice processing significantly higher than the standard 2D voice processing.
Figure 3: GMOS as a function of noise type for 3D voice processing and standard 2D voice processing
3D voice processing value added
Third-dimensional voice processing provides a variety of benefit to consumers in addition to improved voice control. It takes the strain out of hearing, being heard and understood in any surroundings when speaking directly in the microphone and in speaker mode. Conference calls can be taken on-the-go, in an office or in a noisy public venue without compromising voice quality or intelligibility. Background sound replacement, with music or other sounds, can provide a potential revenue generator for operators with new services that could be available at a premium like ring tones.
For safely and convenience reasons, hands-free operation will often be the first choice with consumers. And yet voice control is just beginning to see its true potential. Test results indicate that 3D voice processing can significantly improve the reliability and usability of voice control enabling it to become a valuable differentiator. With the latest technology the additional benefits can be realized by consumers while enabling operators and consumer electronics manufacturers to also experience a new series of revenue generating products and services.