My name is Brian Whitman. I am a lapsed scientist and sound artist currently co-founder/CTO at The Echo Nest, a music intelligence company in Somerville, MA. As I work on various scaling and media search problems with detours into art projects I'll be posting details here in the hopes that I can learn from others. I'd always like to hear from you if you are working on similar things.
(Part 1 of many of my Fake Solr Consultancy Service, I plan to call it Lexatexy or BARISMO) oh and I will only consult to coffee companies or yacht club membership sites. In my dream world LEXATEXY does not do actual work, he says Hm a lot and is not under any deadlines or commitments. He is the text retrieval kin of Steve Miller’s titular “Joker”
YOUR COMMIT RULE OF THUMB
LEXATEXY: Add a single test document to your index. Commit. Does it take longer than 10s?
Happy customer: No, it took 10 minutes.
LEXATEXY: Hm. How many documents do you have
Happy customer: 10 million.
LEXATEXY: That’s a lot for a single index, but it still shouldn’t take 10 minutes. Do you ever optimize?
Happy customer: I tried that once and i couldn’t query for four hours.
LEXATEXY: Rsync your data somewhere else, boot a new server on it, optimize it there, then sync it back.
Happy customer: Wow that is annoying. Any hey when I do it I get merge exceptions.
LEXATEXY: Yeah, that sucks but now your commits are better I bet. I bet what happened is the server crashed and restarted while you were adding lots of documents, and you have duplicates and the only way out is to re-index. (Did you know this bug is fixed as of Solr 1.3? That wasn’t my problem today.) You can try the java org.apache.lucene.index.CheckIndex tool with -fix on. Then you should be fine for a while. Until it happens again.