Saturday, September 17, 2011

Naive Monte Carlo and Map Reduce

Think about map reduce and Monte Carlo methods. Specifically think about the prototypical application of Monte Carlo methods to measure a volume defined by a function. In this application one samples a set of points and counts what proportion fall within the volume. The simplest case is to sample uniformly over some bounding rectangle whose volume is known, the measured volume is this known volume times the calculated proportion. The wikipedia article is pretty good: http://en.wikipedia.org/wiki/Monte_Carlo_integration

Naive Monte Carlo is a one shot deal. The sampling distribution doesn't change over time and specifically doesn't depend on any prior samples. This would seem perfect for map reduce. You have some source that generates random numbers uniformly over the bounding rectangle. The mapping function is the indicator function for the volume being measured. The reducer is the mean. Lets consider each of these in turn.

In a more standard map reduce application the source would be some file or set of data. Here though its a generator. The generator is generating a sequence of n-dimensional random vectors. Unless the dimensionality is very large the bandwidth requirements are probably pretty low, although that must really be considered relative the the time required to evaluate the indicator function. Random number generation can be relatively CPU intensive, and so we should probably distribute that generation. Its not exactly clear how to do this, especially if the number of live remote nodes varies over time, or if there is some variability in the time required to evaluate the indicator function. I don't know how to do this ... yet, but I think its interesting.

The indicator function is a pretty standard map operation. Nothing too interesting here. At least until we consider sampling from a distribution other than uniform. The one interesting thing is that the output of this function is small, i.e., a single bit.

The reduction operation is also pretty straightforward. Some observations are worth making though. The inputs and outputs of the reducer are small as is its state. The reducer is commutative but not associative. Associativity is really required in order to distribute the reduction too. But the mean is just sum / count. And both sum and count are associative and commutative so we really want to reduce to both the count and the sum and just calculate the mean at the end.

All in all we have something that doesn't ever need the filesystem and has minimal bandwidth requirements. The challenges are with regard to how we distribute the generator and how we distribute the reduction.

This problem is one that should be perfect for Numerate's distributed infrastructure that we're about to open source.

Considering Open Sourcing some code

I manage tech development at Numerate where we use a lot of compute to analyze molecules. We've been using 100s of cores to do this for years now, since ~2001, i.e., before the rise of Hadoop and other distributed frameworks. Because we did this early we had to develop our own infrastructure. Because it was clear that the cloud was coming and we were not enjoying managing our own hardware but it wasn't clear what the cloud would be we had to design something with minimal constraints. The upshot is that we built something interesting. Its different from Hadoop in a few ways, perhaps the most important is that it is designed for low IO:CPU ratios and for low latency in contrast with Hadoop. In fact, I think of Hadoop as a bandwidth platform rather than a compute platform.

Anyway, we've decided to open source this platform and are thinking about how to do this. I've been reading http://producingoss.com/ and found it quite helpful. The biggest challenge though is to distinguish what we've done from what other people have done and from what Hadoop does well. Why would people use our platform and not Hadoop? In particular, what's a good example application -- other than ours which we can't open source?

This turns out to be somewhat challenging to lay out succinctly.

I'm not going to get into all of the details here, we hope to get this up and running in a few weeks so the arguments will be laid out then. The upshot though is that Monte Carlo methods, and/or particle filters may be good example problems.

My second look at Scala

Second impressions are not so good I'm afraid. Unfortunately Scala is full of power tools and so its very easy to hurt yourself or others. Take operator overloading for example.

I like operator overloading when the operator itself carries with it a sufficient context to understand what's going on, e.g., + means addition and addition has certain properties that are well understood. What does this mean >;> frankly I have no idea. So how is naming a function with >;> helpful? I've found it pretty useful to have functions have meaningful names, sometimes its the only documentation you get and good names can lead to self documenting code. But >;> is not helpful.

Implicit type conversions are great, if used carefully. But they give me nightmares of completely illegible code. Code that you cannot read outside of an IDE. Code that requires context sensitive help just to know what type you have. Mixins are similarly scary.

All of these feel like ways that inexperienced developers could destroy the productivity of experienced developers by writing unreadable code, and by implicitly modifying the behavior of code written by others.

Anyway, the upshot for me is that Scala still seems very interesting but absent a set of coding conventions I would be terrified of using it for a big project. I wonder if such a set of conventions exists.

Thursday, August 25, 2011

Semantic Web

I mentioned previously that I'm working on a little side project to see what's involved in replicating Watson.

(Generally, I find these kinds of side projects a good way to get exposed to new technologies and to really learn things. Also, its a lot of fun.)

Anyway, I'm looking around to see what kinds of resources are available to be brought to bare. I started by using webharvest to scrape the Jeopardy Archive. Now I'm looking at semantic web resources. I wasn't really aware how far this stuff has been pushed. I should note that I'm fairly skeptical of the more ambitious goals of semantic web. But, there are some useful tools available. For example, DBpedia has extracted a lot of information from wikipedia into RDF format, that is, semi-structured data. There are some free triplestores that can query this data quickly, e.g., OWLIM. Some of these even do minimal inference. And they implement a standard query language.

DBpedia has a pretty good database of people. It includes things like place/date of birth, parents, books written, positions held etc. So my short term goal now is to see if I can classify the answers of Jeopardy questions into person, not person. Then I'll see if I can classify the questions into ones that have a person as an answer or not. Then I'll try to extract enough structure from the questions to see if I can generate reasonable queries against DBpedia to answer them. An alternative at that point would be to do a web search for key components of the question and look for co-occurrences of people's names. I may need to try to narrow the answer categories even more, e.g., male/female, politician/author/actor, etc.

Right now, it looks like no "rocket science" is required here. But it does look like a lot of cases would have to be handled -- its a little too brute force. Generally though, I think it will be instructive to do at least a couple of cases by this kind of brute force.

Anyway, should be fun.

My first look at Scala

I took my first look at Scala today and I like it. The mixins idea is great. I've missed operator overloading. I love functional languages. The type inference seems cool. Everything is an object is great. Lower and upper type bounds make total sense. Partial functions are very interesting.

I don't like that they kept exceptions and that side-effects are allowed. Probably pragmatic choices, but I still hope for an efficient pure functional language.

Overall, I'm pretty interested in trying it out. I might even propose that the next new non-GUI project we do at work use Scala.


Saturday, August 20, 2011

Don't tweak the DB

Generally, I hate "tweaking the DB" as a performance strategy. I realized that not everyone shares this hatred and I've been trying to verbalize my rationale.

I think the reason I hate it is because it often breaks the DB abstraction. Unfortunately, the most implicit parts of any interface are the performance considerations. In fact, performance is not even really part of the interface. It remains unspecified, undocumented and unreliable.

All of this is especially the case with a database. Nowadays we write DB code to be portable -- we insert layers and develop narrow APIs so that we can switch the underlying database at some future time to address some future consideration.

We run databases on a range of hardware, and we carefully tune the hardware to achieve the desired performance.

Our applications do not function without our DB tweaks.

What we've done is implicitly rely on the implicit performance API of a large piece of software. The performance of that software depends greatly on the underlying hardware, on what else is running on the machine, and on what other DB users are doing. None of this is well documented.

SO, given two solutions to a performance issue that are "equivalent" effort, e.g., software development effort and DB tweaking, where one is implemented across the system boundary between your application and the DB, and the other resides solely in you application -- the latter should win EVERY time IMO.

What is a CTO?

As I've noted in some other posts, I'm the CTO at a Bay Area startup. My company is somewhat noteworthy for the technical and scientific depth of our problem domain. Specifically, we apply modern machine learning and cutting edge cloud computing to predict the properties of drugs. Ok, so I'm getting to the point ...
At a startup you wear many hats and as the company grows you need to start to pass some of those hats on to others. I've been thinking about how to do this and how to separate and define the role of CTO versus V.P. of engineering.
For me the key difference is vision versus execution. The role of the CTO is to develop and effectively communicate, evangelize, and sell the company's technical vision. The role of a V.P. of engineering is to efficiently implement that vision.