In a previous post I discussed Naive Monte Carlo and Map Reduce. I observed that the key challenges were how to distribute the random number generator and the reducer. Here I'd like to look at these issues a little more, specifically I'd like to look at how they become more challenging with even slightly more advanced Monte Carlo methods.
Suppose our method no longer uses samples chosen uniformly, but rather that they're from some other distribution. There are many ways of transforming a uniform distribution into another one. Some of them require additional sampling, some even require a random number of samples. So the question now arises, what are we distributing to each node? A fixed number of samples, or a fixed number of evaluations. Either way how do we deal with the variable number of samples?
Suppose further that we modify our method to also maintain an estimate of its accuracy and to stop early once the estimated accuracy is good enough. Now the mapping to Map Reduce breaks somewhat. Map reduce (as I understand it) consists of three distinct phases. In the first phase the data is scattered to a number of evaluation nodes. In the second phase these nodes evaluate some operations on the data. In the third phase the results are gathered and aggregated by the reducer. Typically these phases are conceptually distinct. Here though they overlap, in fact, the scatter phase isn't fully specified until we see some incremental results from the reducer. I haven't looked at what it would take to do this in Hadoop, but I'm guessing it would be somewhat painful.
This also gives rise to a more general question of how you express this concept in a general way that doesn't destroy opportunities for parallelism. You can't just loop until accuracy > x. This doesn't express the notion that any value for accuracy > x is sufficient rather than just the first. Typically, looping until variable > x implies that we loop until this is first true. This means every time we evaluate the loop body we have to evaluate the guard which in general makes the loop sequential. In the Monte Carlo case any value > x is fine, i.e., we can do extra iterations of the loop and so the loop is really parallel. How do we express this generally? What is the general case outside of Monte Carlo?
Thursday, January 12, 2012
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment