Unemployment Diaries: WiFi ePaper display

IMG 2110

As you may have heard I’ve been having some on-purpose downtime to catch up a bit on some personal projects. My time off has been a bit more existential than I initially planned on:

But it’s been going well, thank you for asking! I’ve been keeping busy / distracted by finally clearing out a long stack of personal projects, none with any commercial potential. I’ve noticed something interesting — when others use the word “hobbies” to describe what I’m up to, I recoil in fear, but I simultaneously downplay everything I work on in my studio in Greenpoint as temporary and “just for fun.” I should probably just own what I’m doing a bit more: it is OK to do things that won’t end up changing the world. I see it as catch-up education after not building something I could touch for many, many years. 

I put together a nice WiFi enabled ePaper / e-Ink display to hang in the kitchen and show @cookbook recipes to inspire us to try different recipes. It was surprisingly fun and easy. I wrote up a HOWTO guide on GitHub, with all the connections, code and equipment you need (not much, and you can buy almost all of it on Sparkfun or Amazon.) It updates once an hour and picks a random tweet from a web service hosted on Google App Engine, but you can have it do whatever you’d like. I learned a lot about the low power mode of the ESP8266, and that combined with the EPD you’ve got a screen that can last years hanging on a wall with a 2000mAh battery. 

To be honest, I started down this path because I want an actual computer with an e-Ink display. I was hoping for a “FreeWrite” style keyboard & EPD combo with a larger work area and more functionality. I ended up buying embarrassingly large amounts of random EPD development boards off of eBay and DigiKey, and found that none yet are big enough or have good enough refresh rate to be slightly useable for a text-based interactive computer display. You can attempt to hack up a recent Kindle or Nook but it’s terribly fiddly. The Pervasive Display kits are close, as the newer models allow sub-region updates (you can tell the EPD to only update a bounding box instead of the whole thing, good for interactive text editing / display), but not big enough and geared towards Raspberry Pi “shields,” with fiddly I2C or SPI wiring arranged for their 40 pin headers. The board I used for this uses a simple serial protocol that anything can control, but takes almost a second to update the screen. So I toned down my ambitious design and just made a nice object. More of that to come.

IMG 2104

“Build your own Echo?” The ReSpeaker mic array

Like so many of you, I visited family over the holidays to find their home riddled with talkative but useful voice control devices from Google and Amazon. It’s still striking to me how good the speech recognition layer has gotten, especially the acoustic interface. Sure, all the underlying neural networks cutting HMMs off at the knees humming along on banks of GPUs are very interesting, but do you even know how many microphones the Amazon Echo has? Seven! There’s a lot of DSP that happens before the audio gets to Amazon’s servers: beamforming, de-reverberation / echo removal, noise canceling, voice activity detection, localization. The magic that lets the Echo talk with you from across the room owes as much to acoustic & DSP engineering as “machine learning.” And those microphones have a lot to do with it.

IMG 1883


I’ve given a lot of thought to voice & natural language interfaces over the years and multi-microphone voice activity detection (VAD) in particular; I’ve been working on a related hardware project that I hope I can describe soon. In the meantime, I thought I’d see what I can learn about the state of the art in multi-microphone consumer hardware. It’s sadly a bit too hard to crack open an Echo or Home and fiddle with the individual microphones or the DSP processor. But a few months ago, the ReSpeaker Kickstarter caught my attention. I’m innately suspicious of most crowd-funded electronics but the company behind it, Seeed, has shipped before. I knew I’d get something relatively on time, even if it’d be a just-out-of-alpha board with no documentation or useful code. And I did: although delivery was promised for November, I got the package from Shenzhen just last week.

The ReSpeaker package is two things: a MediaTek WiFi MIPS processor running Linux, based on their MT7688, combined with the more interesting ReSpeaker Mic Array that optionally fits on top. The mic array appears to be an almost perfect clone of the Echo microphone module, with the same number of microphones arranged in the same manner. It’s powered by an XMOS xCORE chip. XMOS makes their own dev boards (also shaped like the Echo board) and they go for $500-1500, so for only $79 once available in February, you’ve got a relatively workable 7 microphone far field system to do your bidding. I hope.

The first thing I did with the ReSpeaker system is try to find any documentation. There really isn’t much — a very spartan landing page, and a ill-attended forum. I attached the mic array board to the top of the MediaTek board, fingers crossed I had the right orientation, plugged in the micro USB cable to my mac, and watched some LEDs spin around. Very pretty! But I wanted to dig in a little more than that. I noted my Mac gained a new serial port, so I blindly tried to connect to it using screen at 115200 bps, and got this magical login screen:


Looks like the MediaTek board is running OpenWrt. Poking around, you can see that the mic array is attached over USB, and there is a python library for getting audio from the board as well as control over the LEDs over USB HID. The Python library led me to an in-progress but official getting started page, so I took the time to set up the WiFi on the board and try out some of the examples. 

It’s clear that the ReSpeaker as sold is not going to fully replace an Echo or Google Home device for you. You can run a speech recognition (ASR) kit (PocketSphinx) on the MIPS chips of the MediaTek, but it’s very slow with obvious buffer under-runs. The way the team at Seeed would like this used is for the on-chip ASR to perform only “wake word” processing — listening for a short phrase only, then passing along actual ASR duties to a 3rd party remote API like Bing / Cortana, Google Voice, Amazon, etc. This is a reasonable request and in line with how the other hardware devices work, but if you were hoping for an “offline Echo,” this board will not help you. I was able to get wake word processing running using their Python library & PocketSphinx, and the lag in detection would be a deal killer for anything more than toy examples. But that’s fine — the MediaTek processor is not the exciting part of the ReSpeaker package. 

If you simply connect the mic array alone to your computer over USB (the array has its own micro-USB port), you get a audio class compliant microphone input that reports itself as 2 channel, 16KHz 24 bit audio. By default, the mic array is flashed with what appears to be custom built XMOS firmware — no source, binary only — that has it doing beamforming, automatic gain control, de-reverberation and noise reduction using all seven microphones. The output is a single audio stream (it supports stereo but it looks like both channels are always the same) over USB that will work in any normal audio software / toolkit that support 16KHz recording. That input sample rate is odd enough to trip up Adobe Audition, for example (my Mac laptops’ output does not support 16KHz playback so it cannot set up the stream), but Audacity and Portaudio work fine. So without doing anything you can get a great far field USB microphone tuned for voice control across a room, with much better acoustic specifications than the internal microphones on your laptop.

There’s a lot of DSP power on the XMOS chip, and we should be able to tweak parameters re-configure the seven microphones to perform under different circumstances. It turns out the binary firmware installed on the ReSpeaker mic array is set up with a series of HID registers for parameterized control and data access. Out of the box, you can ask the mic array for statistics about the voice input, or change features of the acoustic processing in real time. To demonstrate, I built a simple C or Python (your choice!) script that wraps a USB HID library to get access to the registers of the mic array. For example, if you run the Python example, you can record audio while also seeing the detected angle of the voice (where in space the voice source is) as well as when the device detects speech (“voice activity detection” aka VAD.) Or can you can make all the LEDs glow different colors, change the gain control, bypass the DSP, and a lot more. Check it out! 

The code uses the hidapi library to access USB HID registers on the mic array device. You can write or read to USB HID registers using a pretty straightforward socket-style approach. For example, reading the status of something (say, the automatic gain controls’ current dB, or the state of an onboard LED) involves writing to a request register and then reading it back:

# Read length data from a register, return the data
def read_register(register, length):
# To read a register you send reg & 0x80, and then read it back
# If you have blocking off the read will return none if it's too soon after
send_data = [0, register, 0x80, length, 0, 0, 0]
what = _dev.write(send_data)
ret = _dev.read(len(send_data) + length)
return ret[4:4+length] # Data comes in at the 4th byte

view raw
hosted with ❤ by GitHub

Luckily, the developers behind the ReSpeaker uploaded a Microsoft Excel file to one of their GitHub repositories with all of the existing registers that their firmware supports. It’s a bit hard to read, but here it is in CSV form

You can see you have access to a lot of LED control, and then all sorts of parameters involving beamforming, reverb, echo removal, noise removal, gain control, delay estimation, VAD status, and voice angle. Voice angle and VAD are great demos of a microphone array: one can predict from the arrival of data (aka TDOA) into each microphone where the angle of approach of the sound is. Likely, the XMOS firmware is using a variant of GCC-PHAT. Here’s a run of the Python script where I stood around the microphone at different positions in my office:

carry:respeaker-xmos-hid bwhitman$ python listen_and_get_position.py
Mic gain is set to 30
time 0.88428 angle: 30 vad: 2
time 3.54835 angle: 30 vad: 0
time 3.97157 angle: 330 vad: 2
time 5.57193 angle: 330 vad: 0
time 6.38029 angle: 150 vad: 2
time 7.85256 angle: 150 vad: 0
time 8.37984 angle: 150 vad: 2
time 8.97196 angle: 150 vad: 0
time 9.16385 angle: 150 vad: 2
time 9.81978 angle: 150 vad: 0

view raw
hosted with ❤ by GitHub

 Note that in this Python example I’m using a “auto report” register: this is data being sent by the USB HID (the ReSpeaker mic array) no matter if it is being asked for or not, on register 0xFF. In this case, the mic array is broadcasting the very useful data of VAD status (“is there voice coming in right now”) and voice angle (“where is the voice coming from?”) as soon as the VAD status changes, without the USB host having to ask for it. You can also simply ask for the angle or VAD status by querying the registers whenever you want.

For those that want to dig even deeper, XMOS maintains an Eclipse IDE based tool to build new firmware called xTIMEcomposer. The ReSpeaker team also released a DFU flasher to install new firmware on the array. This could make building new types of microphone processing easy.

Using this mic array alone with a more powerful computer (or even a more powerful embedded Linux board like a Raspberry Pi 3) could get you much closer to a home Echo that doesn’t have to “phone home” (to Seattle or Mountain View.) Or you could transmit the voice audio to your own servers. I look forward to the community’s exploration of solid acoustic hardware applied to homegrown ASR & natural language understanding applications, going beyond what the current voice control devices let us do. 

Leaving Spotify & The Echo Nest

My last day at Spotify was last week. I’m not working on music discovery for the first time in my life since May 2000. I’m going to take some time off to finish some personal projects and start something new in 2017. I love Spotify: the people, the product, the creators, the users, and the mission. Their acquisition of my company The Echo Nest in 2014 was excellent for everyone involved but especially for artists and listeners. We’ve changed the landscape of music. It took me a long time to come to this decision, but it’s now time for me to learn more things and try something new.

My professional & personal life for so long has revolved around scalably helping artists find fans and fans finding new artists. I was a musician and computer science grad student in NYC at the top of the millennium, trying to tie together all the fast changes in distribution, recommendation, machine learning, signal processing and natural language understanding. Over the next five years I became an an academic at MIT working to fully explore the connection between the sound of music, the way people described it, and how it was received. With my labmate Tristan’s focus on musical signal processing we started The Echo Nest in 2005. We quickly built our research into products and grew our team to include our CEO Jim, Paul, offices in Somerville, NYC, SF, London and 70 amazing engineers, scientists and music-crazy people. Over the next 9 years, we powered the world of music for practically every single online service out there with a novel developer platform strategy.

The acquisition by Spotify in March 2014 was simply perfect. Both sides put endless amounts of effort into making it work, and within a few months we had fully integrated teams with a stunning new focus on making the best recommendation and music understanding products. The personalization, retrieval and knowledge graph team is now one of the biggest at Spotify. Almost every single former Echo Nest employee is still there after close to 3 years and loving the opportunity – very rare for technology acquisitions. I moved to our NYC office, still regularly visited “Spotify Boston” and was very lucky to sit next to the combined team as they built out our now-tentpole discovery products: Fresh Finds, Discover Weekly, Release Radar, and Daily Mix. Independent artists write me every day with a beautiful story about their appearance on Fresh Finds or an editorial playlist that then scaled to hundreds of thousands or millions of listeners via Discover Weekly or Daily Mix. We’re scaling with them: we’ve heavily invested in the future of discovery through research, machine learning, curation and data engineering, and there’s so many amazing things yet to come. The fight for care & scale in music discovery is far from over, but I can now step back and let their magic happen. It’s extraordinary to be able to watch an entire field form up from under you and even more amazing to be able to walk away to see where it goes.

I’m taking about three months off to rest, regroup, visit companies and friends and finish up some long-simmering personal projects. After that, the only thing I know right now is that I’ll be ready to do it all over again. Like many of my friends, I’m especially reflective these days on the role of prediction, privacy, information retrieval and machine learning on our culture. Music at its best acts as both a lens towards as well as a projection on the rest of our society. We’ve made great strides increasing the diversity of styles and the musicians themselves that people are listening to through careful editorial & algorithmic approaches. We take bad results very personally. We do everything we can to help surface creativity of all types and scale that beautiful moment when a true message hits its receiver. I need to do more, particularly beyond music.

Please reach out if you’d like to chat. You’re all great.

Brian brian@variogr.am

Greenpoint NYC

Nov 16 2016

Understanding the brand new

Fresh Finds

Today, my favorite personal project at Spotify since the acquisition is getting soft-launched alongside a great long piece in Fast Company about discovery at Spotify: “Fresh Finds,” a weekly updated playlist of music that no one has heard yet but will break out soon. The playlist is powered by the careful and passionate work of a small team at Spotify: Kurt Jacobson, Athena Koumis, Jason Steinbach, Dan Stowell, and myself.


“Fresh Finds” is made possible by a scalable analysis of the musical activity happening outside Spotify: daily, we automatically find artists people are talking about on music blogs and news sites more often and with more intensity than their playcounts should suggest. These are the artists that find fans through word-of-mouth, shows and the hard work of making unique music that connects to at least one person. We then filter those artists through a real time analysis of Spotify listening behavior and weekly generate a list of brand new songs that we think we will gain in popularity the next week.

Here’s Fresh Finds, updated weekly on Wednesdays:

The listener activity that happens to music deep in this brand-new, unheard part of the spectrum is hard to automatically understand. It comes from nowhere, and people discover it from other people, often outside of our platform. They read music blogs and press, or a friend in the know passes on a link. It’s not based on popularity or audio or likes or clicks. Some of these Fresh Finds had virtually no plays when data indicated we should publish them on the playlist. Watching a brand new artist release a brand new track with no connection or external push and seeing it at first slowly, then rapidly gain in plays on Spotify has been life-affirming.

I see recommendation, filtering, or prediction of this “brand new” as a new artistic frontier in music understanding. Every music data scientist wants to help artists and listeners, but the quantity of known good that a precise recommendation for a well known band earns all but vanishes when stacked up against the connection between a new unknown and her new fan base. I’m ecstatic to help that process even a little bit.

My favorite jams from the past 6 months of Fresh Finds:

Since Fresh Finds started working internally for us early this year, I’ve discovered more music there than from any other technology or web site or service I’ve ever used. It’s been a great pleasure to hear new musicians’ fresh new work every morning. I’ve seen more local shows, I’ve helped new artists get booked, I awkwardly excitedly tweet about artists, and I’ve never felt better about the future of music.

I hope you enjoy Fresh Finds, and I can’t wait for the next thing we can do with its potential.

10 Years

10 years ago today, in 2005, Tristan and I signed the documents to incorporate The Echo Nest Corporation in Delaware, making me a “co-founder” right out of grad school. We were fanatical about music discovery and wanted our unique blend of technology to help listeners and artists. I’ve since worked harder than I ever thought imaginable, oversaw spectacular successes and overcame massive failures, turned equal parts jerk, compassionate, and anxious ball of wire; I questioned my life’s worth untold times, went from comfortable to destitute to overwhelmed; and I met all of my best friends. And we’re still fanatical about music discovery, and our unique blend of technology is helping listeners and artists.

We made a big change last year, at the peak of our powers: we were generating recommendations and playlists over 300 times a second for our customers and had grown to over 60 attractive employees in early 2014. And since then, of course, it’s gotten even better. But I never trust any startup person that says they’ve won. There’s always more to do. Nonetheless, I’m going to take a breath today to remember all the amazing people that got us here.

To the hundreds of you who were ever a part of this: I thank you to pieces. You both made this thing work, and were the thing itself. I’m sorry if I asked you who paid your salary that one time, or called you at midnight because a service went down, or reminded you how to spell our name, or got too emotional during an all-hands speech, or rewrote one of your lines of code. You have to understand: my dominant feeling throughout the course of The Echo Nest’s life was surprise. I was surprised that we could start a company from our dissertations. I was surprised we could hire people. I was surprised Jim wanted to join us as CEO. I was surprised we could raise money. I was surprised we got our first customers. I was surprised the people we hired cared so much. I was surprised everyone was working so hard, and that the company was becoming so successful. I was surprised Spotify was interested and that it’s worked so well. I remain surprised that everyone’s still with us, happier than ever in our much bigger new family, working even harder on the next big thing.

I was always standing as far ahead of the boat I could, eyes wide in awe that we somehow hadn’t run aground, but barking behind me to try to ensure we wouldn’t. Maybe you all should have tied me to the mast instead; we did great.

The empty office

The empty office, Davis Square, August 2005

The early board

The first board of directors, including Barry, Don Rose & McLagan, Bethe, Andre & Dorsey, and Elliot, 2008

Jim and T and Tim

Jim and Tristan and me and Tim, 2008

Introduction in Amsterdam

My hosts’ introduction of me before a talk in Amsterdam, 2009

Boston Phoenix

Early article in the Boston Phoenix

21 days until Echoprint release

Usual Brian management style, here of poor Alastair pre-Echoprint release, 2010

Ghost tracks

Team Ghost tracks

London team settles in in Somerville

Our London team settles in in Somerville

Elissa and Amanda

Elissa and Amanda

Early 2013 photo

Early 2013 group photo

Telling the office

Telling the office what had just happened (including 6am SF office), March 6 2014

I want to know the size and temperature of bread while it rises


We built a device to measure bread as it rises and in the process I gave a lot of thought to tools and inventors.

If you create meticulous tools and platforms, only a tiny fraction of the world is going to have the desire and knowledge to latch on. Your best hope is to be bright enough for other tool-builders to swarm to, and then hope the weight of the pyramid stacked on top of you won’t kill you. It might be that the startup fan fiction of “prosumer builders” simply doesn’t have a quorum strong enough to fund an industry – maybe their one-off projects remain, just that, jumper wires in a shoe box. The technologists and inventors I know tend to avoid reliance on existing platforms; we desire total control, we want to burrow down from the physical interface (pins, interfaces, connectivity) to the electrons swirling around the transistors until we’re comfortable that what we make is suitably ours.

Early sketch

I thought a lot about this natural law when my friend asked me if a device she had envisioned was possible. She bakes a lot of bread, and wanted a way to visualize the rise: both the size of the growing dough and its internal temperature. She wanted it to show on her phone so she could be somewhere else during a long rise and check in on it once a while. So figure the continuum: we could have bought some “internet of things” kit from Best Buy in a white plastic case that graphed temperature on our phones. And maybe a camera that uploaded live video to Google to check the size. That’s expensive, overwhelming, and although it would work, it’s doesn’t fit what we wanted. It would feel wasteful. A technologist sees that as potential: it could be smaller, it could cost a lot less, and tries to fill it: I could source some microcontrollers, design a PCB, 3D print a case, design a reliable web service, experiment with range and temperature sensors and instead make a bona-fide product where only one would ever be made.

I instead ended up with something new, in the middle, that surprised me: this piece of hardware, the Sparkfun Thing, can ship to you overnight, and costs $16. It can talk over WiFi. It’s got GPIO, ADC, I2C, and a li-poly charger circuit. It can be programmed over USB using the Arduino software. We built the Bread Detector using The Thing, and was simply impressed with the balance of control and “batteries-included” (literally.)

The difference wasn’t the onboard specs or even cost, but the service stack: The Thing comes pre-packaged with instructions and examples for posting the output of sensor data to the Phant data logging service, which is free with limitations – you can send data roughly every 10 seconds, and are limited to 50 MB before it rolls over. Phant was that missing layer, with one line of code I can push bits to a reliable service with a simple API, ready for analysis, graphing, alerts.

The simplicity of the platform also means something like The Bread Detector can be built by anyone, with simple wiring and low-cost sensors. I wonder if the software layer was our problem all along: I can teach a less-inclined person to solder and plug in wires, but would have a lot harder time getting them to set up a web service to store data and respond to queries.

When I mentioned the Phant service and the Thing on Twitter, Antonio, someone I don’t normally bet against, said he thinks it will be free one day. Put on your capitalist hat for a second. How will that work? The Amazon Dash buttons have a clear proposition: you’ll buy more stuff, so give them away. Will the data sitting on Phant soon earn Sparkfun greater revenue than the parts? Will every 5th bread rise give me an ad for King Biscuit?

Perhaps in the near future, the prosumer builders’ products themselves become a very different kind of product.

Detecting some bread

Walk to work

I recorded my walk to work on binaural microphones. You should hear it on headphones sometime, it’s strangely soothing.

I made this all with a (relatively) cheap kit: the Tascam DR-05 recorder and a pair of Soundman OKM II Studio Binaural Microphones. Soundman only appears to sell their microphones, new, on eBay, these days.

When you walk with these on you feel like a performance. You listen just as carefully as the microphones are. You move fluidly and try not to rustle, cough, or act aggressively. I walk this hour (including the 10 minute ferry ride) twice a day, whenever I’m in NYC during the week. Even if I never shared it, or deleted the WAV file immediately, I’d want to keep doing it this way.

The Echo Nest joins Spotify

We’re very excited to announce that The Echo Nest is joining Spotify, starting today! We can’t imagine a better partner for our next chapter. Spotify shares the intense care for the music experience that was the founding principle of our company, and it’s clearly winning the hearts and minds of music fans around the globe. Our dedicated team of engineers, scientists, music curators, business, and product people are utterly electrified with the potential of bringing our world-leading music data, discovery, and audience understanding technology directly to the biggest music streaming audience out there.

Together, we’re going to change how the world listens.

We started this company nine years ago in a kitchen at the MIT Media Lab, our dissertation defenses looming. We never wanted to do anything but fix how people were discovering music. None of the technologies in those days were capable of understanding music at scale. We both were working on our separate approaches, that, when combined, could really do that. All the while, we were watching the world of music change around us. We knew some version of Spotify was to come, and that the real power was in that beautiful moment when you found a new band or song to love. Every decision we’ve made since then, including today’s announcement, was made from that vantage point of care and often insane passion.

Starting a company is a bit crazy. You get the idea you can build a family from scratch and let them loose on the problem that drives you. We moved into an empty room in Somerville, MA in 2005, were soon joined by our CEO Jim Lucchese, and then grew a team of around 70 people, all through the power of communicating our one big idea. It’s hard to overstate how special this place is. With the team we have, we always have every expectation we can do whatever it takes in the service of music. We’ve written a lot of code, we’ve invented technology that will power the future of music for decades to come, we manage reams of data, and we work with everyone in the business. But the true power of this place stems from the people: an amazing family, fully dedicated to building the future of music.

We had such great help on the way. Tristan & Brian’s advisor at MIT and one of the fathers of computer music, Barry Vercoe, supported us through seed investment when we graduated, and when Jim joined, we brought on our dear friends Andre and Dorsey Gardner at Fringe Partners. As we grew, we tapped the great support of Elliot at Commonwealth, Antonio at Matrix and then Jeff at Norwest. And in between was the help and support from dozens of family and friends. We couldn’t have done it without them.

Obviously, moving from behind the curtain to the front stage comes with its own share of questions and challenges. We’ve been lucky enough to work with a wide range of creative companies and independent developers who showed the world what could be done with our technology. They helped us craft and refine our product to where it is today. We look forward to working with partners to embrace the new opportunity to build apps and services using The Echo Nest and Spotify. As we explore this new direction, we’ll help each other move forward.

When we began talking with our longtime friends at Spotify about working together, it became clear how much they share our vision: care for the cause of music at scale. We spent our first weeks together just giddy at the potential of all that special Echo Nest magic working directly with the world’s best place for music. You’re about to see some great stuff from the new Echo Nest-enabled Spotify, and we’re excited to hear what you think. We’re all staying in town, our API stays up, and every single person at our company will continue to focus on building the future of music. Talk to you soon; we’ve got some work to do.

For more information, see our press release.

Brian, Tristan, and Jim

with Aaron, Elissa, Tim, Paul, Matt, Mark, Joe, Eliot, Kurt, David, Dave, Amanda, David, Connor, Shane, Ned, Owen, Ellis, Andreas, Glenn, Joe, Dan, Nick, Aaron, Chris, Aaron, Stu, Kevin, Jason, Ajay, Michelle, Jyotsna, James, Hunter, Erich, Andrew, Nicola, Scott, John, Matt, Matt, Eric, Dylan, Eli, Michael, Adam, Alex, Colin, Jonathan, Marni, Smith, Krystle, Eric, Ben, Conor, Victor, Ryan, Bo, Michael, Athena, Chris, Gurhan, Peter, Kate, Bo, Scott, Jared, Darien, Matt, and Wayne.

Talk about A Singular Christmas at the Automatic Music Hackathon

I gave a talk about my A Singular Christmas at the Automatic Music Hackathon last week. Here’s what it looked like and what I said.

A Singular Christmas

Pretend that you’re new here, and you want to know what a bird is. You’re lucky: lots of people know what a bird is. They can show you a bird. This is Hilary Putnam’s linguistic division of meaning, semantic externalism. If you see enough things labeled, Bird, you start to get a handle on what makes a bird a bird. They’ve got a beak or a certain color, they land on a branch, they spread their wings and fly.

A Singular Christmas

The way I’ve ever understood anything is by endlessly imagining all its forms and presentations. Watch what’s similar and what surprises you. See enough of the same thing, and you can make a little machine to describe it. Snowflakes maybe are circles, except when they’re not. Sometimes fractal edges, sometimes straight, sometimes a number describing the fractalness. Sparkles on the edges, a water droplet from the microscope? So any new snowflake is a set of machines you can add up. Circle plus fractal edge plus sparkles equals your own snowflake.

A Singular Christmas

We’ve all done this. We treat pictures like this, movies, touch. And of course every sound you hear these days is a series of multipliers of a basis function, spit out a speaker so fast you can’t hear the buzz. Add a bunch of component bits together to get your creativity or expression. Rehydrate the vectors into a speaker or screen again, and you probably don’t even notice.

A Singular Christmas

It’s in Pentland’s eigenfaces, so many years ago. You probably walked in the path of a dozen cameras trying this trick on your own face on the way over here tonight. Your phone has it built in, tries to tell if you’re smiling or maybe if you’re someone the government should know about.

A Singular Christmas

But it fails more often than you can imagine. Vision guys call this registration. For a computer to get what something is, it’s got to line up. Keep the eyes in the same pixel. If someone is bigger than someone else. Or an outlier, like Facebook deciding your fishbowl is your grandmother. This is where we’re still better. We don’t normally confuse people with objects, and you only need to do that once. It turns out computers like skipping over repetitive things, and we appreciate those. It turns out computers get confused by loud noises.

A Singular Christmas

I try to make this work better. I like when it fails, often, better than when it works really well. The algorithm annealing into a steady state has to be our culture’s greatest art. That we even had the hubris to encode our senses into a square floating point matrix of numbers. And that we even think that representation is good enough to understand the underlying thing.

A Singular Christmas

I mostly do it with music. People know pretty much everything about every song ever, and there’s databases where you can get the pitch of the tenth guitar note, and what people said about it. Imagine the entire universe describing a song. And then you have the audio, too, and some computer-understandable description of all the events in the song.

A Singular Christmas

I’ve been doing it for a while, this was 2003, a thing called “Eigenradio.” it took every radio station I could get in a live stream at the time, at once, and figured out how to do basis computation and resynthesis in a sort of live stream back. The idea was to be “computer music.” Not music made on a computer, because everything is. But music for computers. What they think music actually is. It mostly sounded like this:

A Singular Christmas

It took a lot of effort to do something like this. I taught myself how cluster computing worked, and scammed MIT into spending far too much money on something would be a free tier on a cloud provider these days. The power kept going out. But the project was my favorite kind of irony, the one where the joke is nowhere near as funny as the reality it pokes at.

A Singular Christmas

I have this whole other life that I’m not going to get into, but it involves knowing about music. Consider Christmas song detection. Thought experiment: imagine someone that doesn’t know Christmas, and you play them a bunch of Christmas songs, will they see a connection? Is there something innately Christmas about the music? Bells? Wide open melodies like a rabbit hopping on a piano? My theory was, if I could synthesize Christmas music from an analysis of all the Christmas music I could find, and people thought it sounded Christmas-y, we’ve cracked the code, we can have a Singular Christmas.

A Singular Christmas

A Singular Christmas

Do you want to know the magic trick? But doing this taught me one important lesson: synthesis is just fast composition. Computer people love to hate themselves because everything is so easy. But we all make things, often beautiful things, even if we didn’t mean to. Even if “the data did it” or you just threw a bunch of Matlab functions together or it only started sounding good when you started panning the sine waves into different channels. You’re composing.

A Singular Christmas

A Singular Christmas

This thing got everywhere. By far the most successful creative thing I’ve done. I was on the BBC on Christmas Eve, exasperatedly spelling out “eigenanalysis.” Pitchfork reviewed it, I got 4 stars. The MIT sysadmins and I had a big fight over its bandwidth. This excited Canadian man, on the radio.

A Singular Christmas

My favorite things are the emails. Every December, right around now, they start slowly rolling in. How this album is the only thing they listen to during the holidays. How it means Christmas to them. I’m still working on this stuff, as a sort of hedge against my more mundane realities. I want to show the world there’s beauty in the act of understanding.

Very large scale music understanding talk @ NAE Frontiers

A few years ago I gave this talk at the very impressive NAE “Frontiers of Engineering” conference via invitation of my more successful academic friends, and noticed they had published the transcript. A rare look at one of the reasons The Echo Nest exists, from my perspective:

Presented at NAE Frontiers of Engineering, 2010


Scientists and engineers around the world have been attempting something undeniably impossible– and yet, no one could ever question their motives. Laid bare, the act of “understanding music” by a computational process feels offensive. How can something so personal, so rooted in context, culture and emotion, ever be discretized or labeled by any autonomous process? Even the ethnographical approach – surveys, interviews, manual annotation – undermines the raw effort by the artists, people who will never understand or even perhaps take advantage of what is being learned and created with this research. Music by its nature resists analysis. I’ve led two lives in the past ten years– first as a “very long-tail” musician and artist, and second as a scientist turned entrepreneur that currently sells “music intelligence” data and software to almost every major music streaming service, social network and record label. How we got there is less interesting than what it might mean for the future of expression and what we believe machine perception can actually accomplish.

In 1999 I moved to New York City to begin graduate studies at Columbia working on a large “digital government” grant, parsing decades of military documents to extract the meaning of the acronyms and domain specific words. At night I would swap the laptops in my bag and head downtown to perform electronic music at various bars and clubs. As much as I tried to keep them separate, the walls came down between them quickly when I began to ask my fellow performers and audience members how they were learning about music. “We read websites,” “I’m on this discussion board,” “A friend emailed me some songs.” Alongside the concurrent media frenzy on peer to peer networks (Napster was just ramping up) was a real movement in music discovery– technology had obviously been helping us acquire and make music, but all of a sudden it was being using to communicate and learn about it as well. With the power of the communicating millions and the seemingly limitless potential of bandwidth and attention, even someone like me could get noticed. Suitably armed with an information retrieval background alongside an almost criminal naiveté regarding machine learning and signal processing I quit my degree program and began to concentrate full time on the practice of what is now known as “music information retrieval.”

The fundamentals of music retrieval descend from text retrieval. You are faced with a corpus of unstructured data: time-domain samples from audio files or score data from the composition. The tasks normally involve extracting readable features from the input and then learning a model from the features. In fact, the data is so unstructured that most music retrieval tasks began as blind roulette wheels of prediction: “is this audio file rock or classical” [Tzanetakis 2002] or “does this song sound like this one” [Foote 1997]. The seductive notion that a black box of some complex nature (most with hopeful success stories baked into their names– “neural networks,” “bayesian belief networks,” “support vector machines”) could untangle a mess of audio stimuli to approach our nervous and perceptual systems’ response is intimidating enough. But that problem is so complex and so hard to evaluate that it distracts the research from the much more serious elephantine presence of the emotional connection underlying the data. A thought experiment: the science of music retrieval is rocked by a massive advance in signal processing or machine learning. Our previous challenges in label prediction are solved– we can now predict the genre of a song with 100% accuracy. What does that do for the musician, what does that do for the listener? If I knew a song I hadn’t heard yet was predicted “jazz” by a computer, it would perhaps save me the effort of looking up the artist’s information, who spent years of their life defining their expression in terms of or despite these categories. But it doesn’t tell me anything about the music, about what I’ll feel when I hear it, about how I’ll respond or how it will resonate with me individually and within the global community. We’ve built a black box that can neatly delineate other black boxes, at no benefit to the very human world of music.

The way out of this feedback loop is to somehow automatically understand reaction and context the same way we could with perception. The ultimate contextual understanding system would be able to gauge my personal reaction and mindset to music. It would know my history, my influences and also understand the larger culture hovering around the content. We are all familiar with the earliest approaches to contextual understanding of music – collaborative filtering, a.k.a. “people who buy this also buy this” [Shardanand 1995] – and we are also just as familiar with its pitfalls. Sales or activity based recommenders only know about you in relationship to others– their meaning of your music is not what you like but what you’ve shared with an anonymous hive. The weakness of the filtering approaches become vivid when you talk to engaged listeners: “I always see the same bands,” “there’s never any new stuff” or “this thing doesn’t know me.” As a core reaction to senselessness of the filtering approaches I ended up back at school and began applying my language processing background to music– we started reading about music, not just trying to listen to it. The idea was that if we could somehow approximate even one percent of the data that communities generate about music on the internet– they review it, they argue about it on forums, they post about shows on their blog, they trade songs on peer to peer networks– we could start to model cultural reaction at a large scale. [Whitman 2005] The new band that collaborative filtering would never touch (because they don’t have enough sales data yet) and acoustic filtering would never get (because what makes them special is their background, or their fanbase, or something else impossible to calculate from the signal) could be found in world of music activity, autonomously and anonymously.

Alongside my co-founder, whose expertise is in musical approaches to signal analysis [Jehan 2005], I left the academic world to start a private enterprise, “The Echo Nest.” We are now thirty people, a few hundred computers, one and a half million artists, over ten million songs. The scale of this data has been our biggest challenge: each artist has an internet footprint of on average thousands of blog posts, reviews, forum discussions, all in different languages. Each song is comprised of thousands of indexable events and the song itself could be duplicated thousands of times in different encodings. Most of our engineering work is in dealing with this magnitude of data– although we are not an infrastructure company we have built many unique data storage and indexing technologies as a byproduct of our work. The set of data we collect is necessarily unique: instead of storing the relationships between musicians and listeners, or only knowing about popular music, we compute and aggregate a sort of internet-scale cache of all possible points of information about a song, artist, release, listener or event. We began the company with the stated goal to index everything there is about music. And over these past five years we have built a series of products and technologies that take the best and most practical parts from our music retrieval dissertations and package them cleanly for our customers. We sell a music similarity system that compares two songs based on their acoustic and their cultural properties. We provide tempo, key and timbre data (automatically generated) to mobile applications and streaming services. We track artists’ “buzz” on the internet and sell reports to labels and managers.

The core of the Echo Nest remains true to our dogma: we strongly believe in the power of data to enable new music experiences. Since we crawl and index everything, we’re able to level the playing field for all types of musicians by taking advantage of the information given to us by any community on the internet. Work in music retrieval and understanding requires a sort of wide-eyed passion combined with a large dose of reality. The computer is never going to fully understand what music is about, but we can sample from the right sources and do it often enough and at a large enough scale that the only thing in our way is a leap of faith from the listener.