Saturday, September 17, 2011

Naive Monte Carlo and Map Reduce

Think about map reduce and Monte Carlo methods. Specifically think about the prototypical application of Monte Carlo methods to measure a volume defined by a function. In this application one samples a set of points and counts what proportion fall within the volume. The simplest case is to sample uniformly over some bounding rectangle whose volume is known, the measured volume is this known volume times the calculated proportion. The wikipedia article is pretty good: http://en.wikipedia.org/wiki/Monte_Carlo_integration

Naive Monte Carlo is a one shot deal. The sampling distribution doesn't change over time and specifically doesn't depend on any prior samples. This would seem perfect for map reduce. You have some source that generates random numbers uniformly over the bounding rectangle. The mapping function is the indicator function for the volume being measured. The reducer is the mean. Lets consider each of these in turn.

In a more standard map reduce application the source would be some file or set of data. Here though its a generator. The generator is generating a sequence of n-dimensional random vectors. Unless the dimensionality is very large the bandwidth requirements are probably pretty low, although that must really be considered relative the the time required to evaluate the indicator function. Random number generation can be relatively CPU intensive, and so we should probably distribute that generation. Its not exactly clear how to do this, especially if the number of live remote nodes varies over time, or if there is some variability in the time required to evaluate the indicator function. I don't know how to do this ... yet, but I think its interesting.

The indicator function is a pretty standard map operation. Nothing too interesting here. At least until we consider sampling from a distribution other than uniform. The one interesting thing is that the output of this function is small, i.e., a single bit.

The reduction operation is also pretty straightforward. Some observations are worth making though. The inputs and outputs of the reducer are small as is its state. The reducer is commutative but not associative. Associativity is really required in order to distribute the reduction too. But the mean is just sum / count. And both sum and count are associative and commutative so we really want to reduce to both the count and the sum and just calculate the mean at the end.

All in all we have something that doesn't ever need the filesystem and has minimal bandwidth requirements. The challenges are with regard to how we distribute the generator and how we distribute the reduction.

This problem is one that should be perfect for Numerate's distributed infrastructure that we're about to open source.

No comments:

Post a Comment