web statistics

notes.variogr.am

My name is Brian Whitman. I am a lapsed scientist and sound artist currently co-founder/CTO at The Echo Nest, a music intelligence company in Somerville, MA. As I work on various scaling and media search problems with detours into art projects I'll be posting details here in the hopes that I can learn from others. I'd always like to hear from you if you are working on similar things.

Jan 3rd, 2009 @ 1:40 pm

Distributing Solr Nicely

Sometime in early 08 it became clear to LEXATEXY that our biggest client’s solr servers had too many documents on them. [Rule of thumb for “how many documents is too many?”: 15 million. I can upgrade, you bargain. What if I have a super-machine bought with my recently closed VC round? 

10 million documents ONLY

15 million.]

You’ve already got a pretty reliable thing handling 15 million documents. It can sort, facet, do autocomplete, metaphone spelling search, it uses an ID scheme you thought was clever, it doesn’t restart itself manually with OOM KILLER messages. It has a URL you thoughtfully CNAMEd to solr.mycompanyname.com. So now you’re telling me it’s time for another server? Thanks a lot, eschatology.

SOLR-303 has been around since mid-07. It wasn’t really ready at all for anyone to use until mid-08, and then, of course, LEXATEXY snapped up on it like a zamboni running on fumes. We’ve used it to successfully go from 1 million to 20 million to then space for 100 million (but straining hard.) Like the rest of Solr, it disarmingly just works. You can set your shards in your solrconfig or at query time (we just wrote it into our python client.) Queries can hit any random shard (even one that is not in the shards list!) and the shard will then ask each server in the shards param for the answer. The host you asked initially will do the query blending and return your result (hopefully with the shard the doc came from in the result). Sort and range queries work OOTB. Big queries with start and rows set are doomed as to ensure you get all results back it sets start to 0 across all shards and rows to start. (You can hack around this.)

If it’s so awesome, why doesn’t LEXATEXY marry it?

I have received some good advice about how to fix some of these problems. They are still in thought phase, but it comes down to a few little things.

  1. Move our doc -> shard map to a consistent hash: see libketama @ last.fm for example (and reaffirming proof that our stupid decisions were also made in larger more inspiring places.) This lessens but does not remove the pain of adding/removing shards.
  2. Query blend on our own when necessary. This is not so scary. You’ll get results immediately (even if they won’t be the top results) and you can optimize for the kinds of things you want to do without getting stuck waiting on the last answer or perhaps a dead host.
  3. Please figure out how to reliably instance and image a solr server across a group of machines. Something has to register them for the snapshooting stuff and for the shards list. Nagios can ping them via JMX but that’s not enough.

I hope people that deal with large Solr and Lucene indexes somehow get pointed here so they can laugh and swap stories. If anyone wants to chime in, I’ll give you some column inches to school us all. But failing that you’re all about to see a few guys and their paid handholders (METALEXATEXY) flail around trying to figure this out for real over the next few months.

Archive · RSS · Theme by Novembird