Tuesday, January 17, 2012

Monte Carlo beyond Map Reduce

I've written a couple off posts here and here about Monte Carlo methods and Map Reduce. Both have considered naive Monte Carlo in the sense that the sampling distribution is fixed and does not depend on any prior evaluations.  However, more sophisticated Monte Carlo methods do change the sampling distribution over time. Again the wikipedia article is pretty good: http://en.wikipedia.org/wiki/Monte_Carlo_integration. In these cases we iterate over the kinds of map reduce evaluations discussed previously, updating the distribution each iteration. The iterative nature of these methods make them more latency sensitive than the typical MapReduce task. If we are performing many iterations each iteration must not incur a significant start-up delay otherwise the total time may be dominated by this delay. Again this is a setting in which Numerate's distributed framework should shine.

Thursday, January 12, 2012

More Monte Carlo and Map Reduce

In a previous post I discussed Naive Monte Carlo and Map Reduce. I observed that the key challenges were how to distribute the random number generator and the reducer. Here I'd like to look at these issues a little more, specifically I'd like to look at how they become more challenging with even slightly more advanced Monte Carlo methods.

Suppose our method no longer uses samples chosen uniformly, but rather that they're from some other distribution. There are many ways of transforming a uniform distribution into another one. Some of them require additional sampling, some even require a random number of samples. So the question now arises, what are we distributing to each node? A fixed number of samples, or a fixed number of evaluations. Either way how do we deal with the variable number of samples?

Suppose further that we modify our method to also maintain an estimate of its accuracy and to stop early once the estimated accuracy is good enough. Now the mapping to Map Reduce breaks somewhat. Map reduce (as I understand it) consists of three distinct phases. In the first phase the data is scattered to a number of evaluation nodes. In the second phase these nodes evaluate some operations on the data. In the third phase the results are gathered and aggregated by the reducer. Typically these phases are conceptually distinct. Here though they overlap, in fact, the scatter phase isn't fully specified until we see some incremental results from the reducer. I haven't looked at what it would take to do this in Hadoop, but I'm guessing it would be somewhat painful.

This also gives rise to a more general question of how you express this concept in a general way that doesn't destroy opportunities for parallelism. You can't just loop until accuracy > x. This doesn't express the notion that any value for accuracy > x is sufficient rather than just the first. Typically, looping until variable > x implies that we loop until this is first true. This means every time we evaluate the loop body we have to evaluate the guard which in general makes the loop sequential. In the Monte Carlo case any value > x is fine, i.e., we can do extra iterations of the loop and so the loop is really parallel. How do we express this generally? What is the general case outside of Monte Carlo?