I manage tech development at Numerate where we use a lot of compute to analyze molecules. We've been using 100s of cores to do this for years now, since ~2001, i.e., before the rise of Hadoop and other distributed frameworks. Because we did this early we had to develop our own infrastructure. Because it was clear that the cloud was coming and we were not enjoying managing our own hardware but it wasn't clear what the cloud would be we had to design something with minimal constraints. The upshot is that we built something interesting. Its different from Hadoop in a few ways, perhaps the most important is that it is designed for low IO:CPU ratios and for low latency in contrast with Hadoop. In fact, I think of Hadoop as a bandwidth platform rather than a compute platform.
Anyway, we've decided to open source this platform and are thinking about how to do this. I've been reading http://producingoss.com/ and found it quite helpful. The biggest challenge though is to distinguish what we've done from what other people have done and from what Hadoop does well. Why would people use our platform and not Hadoop? In particular, what's a good example application -- other than ours which we can't open source?
This turns out to be somewhat challenging to lay out succinctly.
I'm not going to get into all of the details here, we hope to get this up and running in a few weeks so the arguments will be laid out then. The upshot though is that Monte Carlo methods, and/or particle filters may be good example problems.
Saturday, September 17, 2011
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment