At the core of the smart speaker is the Intelligent Virtual Assistant (IVA), enabling the use of voice commands to direct the device to do everything from playing audio content—new, music, podcasts, etc.—to control home automation systems, or even place online shopping orders. It’s worth noting that this same IVA technology, with microphones and loudspeakers to support it, is being added to all sorts of home appliances—thermostats, set top boxes, refrigerators—enabling voice control and thus turning them into “smart devices.” Obviously, most smartphones can also play the role of a smart speaker.
While this section of AP.com, and the variety of resources found here, will primarily be discussing smart speaker testing, most of the content is equally applicable to the broader category of smart devices and the measurement of their audio performance.
Measuring the performance of a smart speaker presents a variety of challenges, whether the testing is focused on a subsystem or the entire device. Many of the challenges are related to the IVA, complexities of the various subsystems and the ensuing audio signal paths.
Smart Speaker IVAs
An interaction with a smart speaker begins with a specific “wake word” or phrase, followed by a command. In their normal operating mode, smart speakers are in a semi-dormant state, but are always “listening” for the wake word, which triggers them to acquire and process a spoken command. In terms of speech recognition, smart speakers themselves are only capable of recognizing the wake word (or phrase). The more computationally-intensive speech recognition and subsequent processing is done by the Intelligent Virtual Assistant on a connected server. Depending upon the evaluation being performed, the wake word may be an integral part of the test process.
Audio Subsystems
Smart speakers contain several distinct audio subsystems, including:
Audio Signal Paths
The primary audio paths for a smart speaker are between the device and the IVA, using the Internet with a Wi-Fi or wired connection. On the input side, a speech signal containing a spoken command is sensed with the device’s microphone array, digitized and uploaded to the IVA for signal processing and command interpretation. On the output side, digital audio content is transmitted from a web server to the device, where it is converted from digital to analog, then finally to an acoustic signal as it is played over the device’s loudspeaker system. Smart speakers may also have several secondary audio paths (e.g., analog output and input jacks, network connections to other smart speakers, etc.)
Audio Testing
The audio subsystems of smart speakers have a multitude of components that contribute to overall performance and audio quality. At some stage, each of these components and systems must be tested, followed eventually by an end-to-end performance evaluation of the overall smart speaker system.
Testing a smart speaker’s primary input and output audio paths can be quite challenging for the following reasons:
1. Input to, and output from, a smart speaker are both acoustic, and acoustic test is by its nature more complex than electronic (analog or digital) audio test. Acoustic tests require calibrated microphones, usually an anechoic test chamber, and a quality loudspeaker system to stimulate DUT microphones.
2. Smart speakers are inherently open loop devices. On the input side, a signal (typically speech) is captured, digitized and transmitted to a server somewhere as a digital audio file. To assess the input path performance, the audio file must be retrieved from the server and analyzed in comparison to the signal that was generated in the first place. On the output side, audio content which originates as an audio file on a server is streamed to the device where it is converted to analog and played on the device’s loudspeaker system. To assess the output path performance, the device’s loudspeaker output must be measured with a measurement microphone and compared with the original signal from the server. The original signal is often in the form of an encoded audio signal (e.g., MP3 or AAC), which requires that it be decoded before analysis.
3. The A/D and D/A converters in the device will invariably have different sample rates than the audio analyzer, requiring some form of compensation during analysis.