web statistics

notes.variogr.am

My name is Brian Whitman. I am a lapsed scientist and sound artist currently co-founder/CTO at The Echo Nest, a music intelligence company in Somerville, MA. As I work on various scaling and media search problems with detours into art projects I'll be posting details here in the hopes that I can learn from others. I'd always like to hear from you if you are working on similar things.

May 28th, 2009 @ 3:36 pm

A list of things that are about parallel matrix storage and computation

I want a thing that stores sparse matrices over N computers and I can do math on those matrices.

My Dream API:

matrix = new(matrix_name, cols, rows)

matrix.put(col, row, value) # (or matrix.put_col(row, data) , matrix.put_row(col, data) )

value = matrix.get(col, row)

new_matrix = matrix.multiply(matrix_B)

matrix.transpose()

new_matrix = matrix.invert(iterations=0)

[U, S, V] = matrix.svd(iterations=0)

… etc. Why not use MATLAB hurf durf, right. The trick: the backend has to be some distributed thingy that I can boot new machines to both support storage and compute at any time. The matrix I have in my tiny brain is about 100K cols x 1bn rows, with about 1% nonzeros. And that’s just one of them. That’s more data than can fit on a computer and more compute that can run on a computer.

I would think this is something the world has been furiously working on in the four years since I last logged out of a lamboot’d terminal at MIT. It was pretty bad back then, you had to have the right fortran compilers, you had to know what rcp did, set up your own NFS servers on firewire drives, get BLACS working before CBLAS before SCALAPACK etc etc. So when I fired up my searches this a.m. I was hoping to see a sea of rounded corners, drop shadows and product names with missing vowels all solving my problems. I was expecting Erlang implementations of Block Lanczos, maybe acts_as_golub. No dice.

My first stop was Hadoop, Doug Cutting’s post-Lucene MapReduce impl:

Next I tenderly looked up my old friends that use MPI. These get at the compute part but not the storage part.

Non-parallel eigenvalue problem “solvers” include:

So it looks like I am going to end up using MPI. Everything serious relies on it. Now I’ve got to figure out how to manage this without having to deal with configuration cruft. AMZN EC2-to-MPI bridges include:

Now, what about storage? Could I really hack up SLEPc or SVDLIBC to use our DHT key-value store to get at matrix values? Is that insane? I have no interest in putting 8TB of data on a EBS and NFS exporting it for the rest of the cluster. And what, do I use an ASCII sparse matrix format

Archive · RSS · Theme by Novembird