Saturday, September 17, 2011

Naive Monte Carlo and Map Reduce

Think about map reduce and Monte Carlo methods. Specifically think about the prototypical application of Monte Carlo methods to measure a volume defined by a function. In this application one samples a set of points and counts what proportion fall within the volume. The simplest case is to sample uniformly over some bounding rectangle whose volume is known, the measured volume is this known volume times the calculated proportion. The wikipedia article is pretty good: http://en.wikipedia.org/wiki/Monte_Carlo_integration

Naive Monte Carlo is a one shot deal. The sampling distribution doesn't change over time and specifically doesn't depend on any prior samples. This would seem perfect for map reduce. You have some source that generates random numbers uniformly over the bounding rectangle. The mapping function is the indicator function for the volume being measured. The reducer is the mean. Lets consider each of these in turn.

In a more standard map reduce application the source would be some file or set of data. Here though its a generator. The generator is generating a sequence of n-dimensional random vectors. Unless the dimensionality is very large the bandwidth requirements are probably pretty low, although that must really be considered relative the the time required to evaluate the indicator function. Random number generation can be relatively CPU intensive, and so we should probably distribute that generation. Its not exactly clear how to do this, especially if the number of live remote nodes varies over time, or if there is some variability in the time required to evaluate the indicator function. I don't know how to do this ... yet, but I think its interesting.

The indicator function is a pretty standard map operation. Nothing too interesting here. At least until we consider sampling from a distribution other than uniform. The one interesting thing is that the output of this function is small, i.e., a single bit.

The reduction operation is also pretty straightforward. Some observations are worth making though. The inputs and outputs of the reducer are small as is its state. The reducer is commutative but not associative. Associativity is really required in order to distribute the reduction too. But the mean is just sum / count. And both sum and count are associative and commutative so we really want to reduce to both the count and the sum and just calculate the mean at the end.

All in all we have something that doesn't ever need the filesystem and has minimal bandwidth requirements. The challenges are with regard to how we distribute the generator and how we distribute the reduction.

This problem is one that should be perfect for Numerate's distributed infrastructure that we're about to open source.

Considering Open Sourcing some code

I manage tech development at Numerate where we use a lot of compute to analyze molecules. We've been using 100s of cores to do this for years now, since ~2001, i.e., before the rise of Hadoop and other distributed frameworks. Because we did this early we had to develop our own infrastructure. Because it was clear that the cloud was coming and we were not enjoying managing our own hardware but it wasn't clear what the cloud would be we had to design something with minimal constraints. The upshot is that we built something interesting. Its different from Hadoop in a few ways, perhaps the most important is that it is designed for low IO:CPU ratios and for low latency in contrast with Hadoop. In fact, I think of Hadoop as a bandwidth platform rather than a compute platform.

Anyway, we've decided to open source this platform and are thinking about how to do this. I've been reading http://producingoss.com/ and found it quite helpful. The biggest challenge though is to distinguish what we've done from what other people have done and from what Hadoop does well. Why would people use our platform and not Hadoop? In particular, what's a good example application -- other than ours which we can't open source?

This turns out to be somewhat challenging to lay out succinctly.

I'm not going to get into all of the details here, we hope to get this up and running in a few weeks so the arguments will be laid out then. The upshot though is that Monte Carlo methods, and/or particle filters may be good example problems.

My second look at Scala

Second impressions are not so good I'm afraid. Unfortunately Scala is full of power tools and so its very easy to hurt yourself or others. Take operator overloading for example.

I like operator overloading when the operator itself carries with it a sufficient context to understand what's going on, e.g., + means addition and addition has certain properties that are well understood. What does this mean >;> frankly I have no idea. So how is naming a function with >;> helpful? I've found it pretty useful to have functions have meaningful names, sometimes its the only documentation you get and good names can lead to self documenting code. But >;> is not helpful.

Implicit type conversions are great, if used carefully. But they give me nightmares of completely illegible code. Code that you cannot read outside of an IDE. Code that requires context sensitive help just to know what type you have. Mixins are similarly scary.

All of these feel like ways that inexperienced developers could destroy the productivity of experienced developers by writing unreadable code, and by implicitly modifying the behavior of code written by others.

Anyway, the upshot for me is that Scala still seems very interesting but absent a set of coding conventions I would be terrified of using it for a big project. I wonder if such a set of conventions exists.