Tuesday, January 17, 2012

Monte Carlo beyond Map Reduce

I've written a couple off posts here and here about Monte Carlo methods and Map Reduce. Both have considered naive Monte Carlo in the sense that the sampling distribution is fixed and does not depend on any prior evaluations.  However, more sophisticated Monte Carlo methods do change the sampling distribution over time. Again the wikipedia article is pretty good: http://en.wikipedia.org/wiki/Monte_Carlo_integration. In these cases we iterate over the kinds of map reduce evaluations discussed previously, updating the distribution each iteration. The iterative nature of these methods make them more latency sensitive than the typical MapReduce task. If we are performing many iterations each iteration must not incur a significant start-up delay otherwise the total time may be dominated by this delay. Again this is a setting in which Numerate's distributed framework should shine.

Thursday, January 12, 2012

More Monte Carlo and Map Reduce

In a previous post I discussed Naive Monte Carlo and Map Reduce. I observed that the key challenges were how to distribute the random number generator and the reducer. Here I'd like to look at these issues a little more, specifically I'd like to look at how they become more challenging with even slightly more advanced Monte Carlo methods.

Suppose our method no longer uses samples chosen uniformly, but rather that they're from some other distribution. There are many ways of transforming a uniform distribution into another one. Some of them require additional sampling, some even require a random number of samples. So the question now arises, what are we distributing to each node? A fixed number of samples, or a fixed number of evaluations. Either way how do we deal with the variable number of samples?

Suppose further that we modify our method to also maintain an estimate of its accuracy and to stop early once the estimated accuracy is good enough. Now the mapping to Map Reduce breaks somewhat. Map reduce (as I understand it) consists of three distinct phases. In the first phase the data is scattered to a number of evaluation nodes. In the second phase these nodes evaluate some operations on the data. In the third phase the results are gathered and aggregated by the reducer. Typically these phases are conceptually distinct. Here though they overlap, in fact, the scatter phase isn't fully specified until we see some incremental results from the reducer. I haven't looked at what it would take to do this in Hadoop, but I'm guessing it would be somewhat painful.

This also gives rise to a more general question of how you express this concept in a general way that doesn't destroy opportunities for parallelism. You can't just loop until accuracy > x. This doesn't express the notion that any value for accuracy > x is sufficient rather than just the first. Typically, looping until variable > x implies that we loop until this is first true. This means every time we evaluate the loop body we have to evaluate the guard which in general makes the loop sequential. In the Monte Carlo case any value > x is fine, i.e., we can do extra iterations of the loop and so the loop is really parallel. How do we express this generally? What is the general case outside of Monte Carlo?

Saturday, September 17, 2011

Naive Monte Carlo and Map Reduce

Think about map reduce and Monte Carlo methods. Specifically think about the prototypical application of Monte Carlo methods to measure a volume defined by a function. In this application one samples a set of points and counts what proportion fall within the volume. The simplest case is to sample uniformly over some bounding rectangle whose volume is known, the measured volume is this known volume times the calculated proportion. The wikipedia article is pretty good: http://en.wikipedia.org/wiki/Monte_Carlo_integration

Naive Monte Carlo is a one shot deal. The sampling distribution doesn't change over time and specifically doesn't depend on any prior samples. This would seem perfect for map reduce. You have some source that generates random numbers uniformly over the bounding rectangle. The mapping function is the indicator function for the volume being measured. The reducer is the mean. Lets consider each of these in turn.

In a more standard map reduce application the source would be some file or set of data. Here though its a generator. The generator is generating a sequence of n-dimensional random vectors. Unless the dimensionality is very large the bandwidth requirements are probably pretty low, although that must really be considered relative the the time required to evaluate the indicator function. Random number generation can be relatively CPU intensive, and so we should probably distribute that generation. Its not exactly clear how to do this, especially if the number of live remote nodes varies over time, or if there is some variability in the time required to evaluate the indicator function. I don't know how to do this ... yet, but I think its interesting.

The indicator function is a pretty standard map operation. Nothing too interesting here. At least until we consider sampling from a distribution other than uniform. The one interesting thing is that the output of this function is small, i.e., a single bit.

The reduction operation is also pretty straightforward. Some observations are worth making though. The inputs and outputs of the reducer are small as is its state. The reducer is commutative but not associative. Associativity is really required in order to distribute the reduction too. But the mean is just sum / count. And both sum and count are associative and commutative so we really want to reduce to both the count and the sum and just calculate the mean at the end.

All in all we have something that doesn't ever need the filesystem and has minimal bandwidth requirements. The challenges are with regard to how we distribute the generator and how we distribute the reduction.

This problem is one that should be perfect for Numerate's distributed infrastructure that we're about to open source.

Considering Open Sourcing some code

I manage tech development at Numerate where we use a lot of compute to analyze molecules. We've been using 100s of cores to do this for years now, since ~2001, i.e., before the rise of Hadoop and other distributed frameworks. Because we did this early we had to develop our own infrastructure. Because it was clear that the cloud was coming and we were not enjoying managing our own hardware but it wasn't clear what the cloud would be we had to design something with minimal constraints. The upshot is that we built something interesting. Its different from Hadoop in a few ways, perhaps the most important is that it is designed for low IO:CPU ratios and for low latency in contrast with Hadoop. In fact, I think of Hadoop as a bandwidth platform rather than a compute platform.

Anyway, we've decided to open source this platform and are thinking about how to do this. I've been reading http://producingoss.com/ and found it quite helpful. The biggest challenge though is to distinguish what we've done from what other people have done and from what Hadoop does well. Why would people use our platform and not Hadoop? In particular, what's a good example application -- other than ours which we can't open source?

This turns out to be somewhat challenging to lay out succinctly.

I'm not going to get into all of the details here, we hope to get this up and running in a few weeks so the arguments will be laid out then. The upshot though is that Monte Carlo methods, and/or particle filters may be good example problems.

My second look at Scala

Second impressions are not so good I'm afraid. Unfortunately Scala is full of power tools and so its very easy to hurt yourself or others. Take operator overloading for example.

I like operator overloading when the operator itself carries with it a sufficient context to understand what's going on, e.g., + means addition and addition has certain properties that are well understood. What does this mean >;> frankly I have no idea. So how is naming a function with >;> helpful? I've found it pretty useful to have functions have meaningful names, sometimes its the only documentation you get and good names can lead to self documenting code. But >;> is not helpful.

Implicit type conversions are great, if used carefully. But they give me nightmares of completely illegible code. Code that you cannot read outside of an IDE. Code that requires context sensitive help just to know what type you have. Mixins are similarly scary.

All of these feel like ways that inexperienced developers could destroy the productivity of experienced developers by writing unreadable code, and by implicitly modifying the behavior of code written by others.

Anyway, the upshot for me is that Scala still seems very interesting but absent a set of coding conventions I would be terrified of using it for a big project. I wonder if such a set of conventions exists.

Thursday, August 25, 2011

Semantic Web

I mentioned previously that I'm working on a little side project to see what's involved in replicating Watson.

(Generally, I find these kinds of side projects a good way to get exposed to new technologies and to really learn things. Also, its a lot of fun.)

Anyway, I'm looking around to see what kinds of resources are available to be brought to bare. I started by using webharvest to scrape the Jeopardy Archive. Now I'm looking at semantic web resources. I wasn't really aware how far this stuff has been pushed. I should note that I'm fairly skeptical of the more ambitious goals of semantic web. But, there are some useful tools available. For example, DBpedia has extracted a lot of information from wikipedia into RDF format, that is, semi-structured data. There are some free triplestores that can query this data quickly, e.g., OWLIM. Some of these even do minimal inference. And they implement a standard query language.

DBpedia has a pretty good database of people. It includes things like place/date of birth, parents, books written, positions held etc. So my short term goal now is to see if I can classify the answers of Jeopardy questions into person, not person. Then I'll see if I can classify the questions into ones that have a person as an answer or not. Then I'll try to extract enough structure from the questions to see if I can generate reasonable queries against DBpedia to answer them. An alternative at that point would be to do a web search for key components of the question and look for co-occurrences of people's names. I may need to try to narrow the answer categories even more, e.g., male/female, politician/author/actor, etc.

Right now, it looks like no "rocket science" is required here. But it does look like a lot of cases would have to be handled -- its a little too brute force. Generally though, I think it will be instructive to do at least a couple of cases by this kind of brute force.

Anyway, should be fun.

My first look at Scala

I took my first look at Scala today and I like it. The mixins idea is great. I've missed operator overloading. I love functional languages. The type inference seems cool. Everything is an object is great. Lower and upper type bounds make total sense. Partial functions are very interesting.

I don't like that they kept exceptions and that side-effects are allowed. Probably pragmatic choices, but I still hope for an efficient pure functional language.

Overall, I'm pretty interested in trying it out. I might even propose that the next new non-GUI project we do at work use Scala.