Like so many of you, I visited family over the holidays to find their home riddled with talkative but useful voice control devices from Google and Amazon. It’s still striking to me how good the speech recognition layer has gotten, especially the acoustic interface. Sure, all the underlying neural networks cutting HMMs off at the knees humming along on banks of GPUs are very interesting, but do you even know how many microphones the Amazon Echo has? Seven! There’s a lot of DSP that happens before the audio gets to Amazon’s servers: beamforming, de-reverberation / echo removal, noise canceling, voice activity detection, localization. The magic that lets the Echo talk with you from across the room owes as much to acoustic & DSP engineering as “machine learning.” And those microphones have a lot to do with it.
I’ve given a lot of thought to voice & natural language interfaces over the years and multi-microphone voice activity detection (VAD) in particular; I’ve been working on a related hardware project that I hope I can describe soon. In the meantime, I thought I’d see what I can learn about the state of the art in multi-microphone consumer hardware. It’s sadly a bit too hard to crack open an Echo or Home and fiddle with the individual microphones or the DSP processor. But a few months ago, the ReSpeaker Kickstarter caught my attention. I’m innately suspicious of most crowd-funded electronics but the company behind it, Seeed, has shipped before. I knew I’d get something relatively on time, even if it’d be a just-out-of-alpha board with no documentation or useful code. And I did: although delivery was promised for November, I got the package from Shenzhen just last week.
The ReSpeaker package is two things: a MediaTek WiFi MIPS processor running Linux, based on their MT7688, combined with the more interesting ReSpeaker Mic Array that optionally fits on top. The mic array appears to be an almost perfect clone of the Echo microphone module, with the same number of microphones arranged in the same manner. It’s powered by an XMOS xCORE chip. XMOS makes their own dev boards (also shaped like the Echo board) and they go for $500-1500, so for only $79 once available in February, you’ve got a relatively workable 7 microphone far field system to do your bidding. I hope.
The first thing I did with the ReSpeaker system is try to find any documentation. There really isn’t much — a very spartan landing page, and a ill-attended forum. I attached the mic array board to the top of the MediaTek board, fingers crossed I had the right orientation, plugged in the micro USB cable to my mac, and watched some LEDs spin around. Very pretty! But I wanted to dig in a little more than that. I noted my Mac gained a new serial port, so I blindly tried to connect to it using screen at 115200 bps, and got this magical login screen:
Looks like the MediaTek board is running OpenWrt. Poking around, you can see that the mic array is attached over USB, and there is a python library for getting audio from the board as well as control over the LEDs over USB HID. The Python library led me to an in-progress but official getting started page, so I took the time to set up the WiFi on the board and try out some of the examples.
It’s clear that the ReSpeaker as sold is not going to fully replace an Echo or Google Home device for you. You can run a speech recognition (ASR) kit (PocketSphinx) on the MIPS chips of the MediaTek, but it’s very slow with obvious buffer under-runs. The way the team at Seeed would like this used is for the on-chip ASR to perform only “wake word” processing — listening for a short phrase only, then passing along actual ASR duties to a 3rd party remote API like Bing / Cortana, Google Voice, Amazon, etc. This is a reasonable request and in line with how the other hardware devices work, but if you were hoping for an “offline Echo,” this board will not help you. I was able to get wake word processing running using their Python library & PocketSphinx, and the lag in detection would be a deal killer for anything more than toy examples. But that’s fine — the MediaTek processor is not the exciting part of the ReSpeaker package.
If you simply connect the mic array alone to your computer over USB (the array has its own micro-USB port), you get a audio class compliant microphone input that reports itself as 2 channel, 16KHz 24 bit audio. By default, the mic array is flashed with what appears to be custom built XMOS firmware — no source, binary only — that has it doing beamforming, automatic gain control, de-reverberation and noise reduction using all seven microphones. The output is a single audio stream (it supports stereo but it looks like both channels are always the same) over USB that will work in any normal audio software / toolkit that support 16KHz recording. That input sample rate is odd enough to trip up Adobe Audition, for example (my Mac laptops’ output does not support 16KHz playback so it cannot set up the stream), but Audacity and Portaudio work fine. So without doing anything you can get a great far field USB microphone tuned for voice control across a room, with much better acoustic specifications than the internal microphones on your laptop.
There’s a lot of DSP power on the XMOS chip, and we should be able to tweak parameters re-configure the seven microphones to perform under different circumstances. It turns out the binary firmware installed on the ReSpeaker mic array is set up with a series of HID registers for parameterized control and data access. Out of the box, you can ask the mic array for statistics about the voice input, or change features of the acoustic processing in real time. To demonstrate, I built a simple C or Python (your choice!) script that wraps a USB HID library to get access to the registers of the mic array. For example, if you run the Python example, you can record audio while also seeing the detected angle of the voice (where in space the voice source is) as well as when the device detects speech (“voice activity detection” aka VAD.) Or can you can make all the LEDs glow different colors, change the gain control, bypass the DSP, and a lot more. Check it out!
The code uses the hidapi library to access USB HID registers on the mic array device. You can write or read to USB HID registers using a pretty straightforward socket-style approach. For example, reading the status of something (say, the automatic gain controls’ current dB, or the state of an onboard LED) involves writing to a request register and then reading it back:
Luckily, the developers behind the ReSpeaker uploaded a Microsoft Excel file to one of their GitHub repositories with all of the existing registers that their firmware supports. It’s a bit hard to read, but here it is in CSV form.
You can see you have access to a lot of LED control, and then all sorts of parameters involving beamforming, reverb, echo removal, noise removal, gain control, delay estimation, VAD status, and voice angle. Voice angle and VAD are great demos of a microphone array: one can predict from the arrival of data (aka TDOA) into each microphone where the angle of approach of the sound is. Likely, the XMOS firmware is using a variant of GCC-PHAT. Here’s a run of the Python script where I stood around the microphone at different positions in my office:
Note that in this Python example I’m using a “auto report” register: this is data being sent by the USB HID (the ReSpeaker mic array) no matter if it is being asked for or not, on register 0xFF. In this case, the mic array is broadcasting the very useful data of VAD status (“is there voice coming in right now”) and voice angle (“where is the voice coming from?”) as soon as the VAD status changes, without the USB host having to ask for it. You can also simply ask for the angle or VAD status by querying the registers whenever you want.
For those that want to dig even deeper, XMOS maintains an Eclipse IDE based tool to build new firmware called xTIMEcomposer. The ReSpeaker team also released a DFU flasher to install new firmware on the array. This could make building new types of microphone processing easy.
Using this mic array alone with a more powerful computer (or even a more powerful embedded Linux board like a Raspberry Pi 3) could get you much closer to a home Echo that doesn’t have to “phone home” (to Seattle or Mountain View.) Or you could transmit the voice audio to your own servers. I look forward to the community’s exploration of solid acoustic hardware applied to homegrown ASR & natural language understanding applications, going beyond what the current voice control devices let us do.