Wednesday, June 27, 2012

The funny thing about gravity

We've seen a lot of talk about Data Gravity recently. Essentially the idea is that data has inertia, i.e., once your repository gets big enough its very hard to move. I buy the concept wholeheartedly. It makes a lot of sense. This obviously puts the owners of the repositories (think Amazon's S3) in a great position. They could start charging usurious rents, or more likely, they will end up sucking all of the applications and developers onto their platform. The seemingly obvious conclusion then for any big-data start up is that you must hold the data. You must start accumulating mass. You must require your customers to import their data into your service in order to use it. That way, they're trapped and you've gained mass and inertia. And this seems key to the business model of many new big data startups.

Of course, this neglects the fact that data has mass NOW, not just in the future. Big corporations have lots of data, they cannot move it to your service. You'll have to move your service to their data.

Well, maybe big corporations aren't your customer base. Then you're only competing with Amazon and the other providers who already have a lot of mass. I, for one, am turned off by services that won't let me store my data in S3. You see, I can't have a copy of my data for every application vendor / service provider. So I need you to use my data in some central place where you can all get at it.

The accretion of mass has this positive feedback to it. Once there are a few massive entities, they will accumulate essentially all of the mass. What are the odds that a new planet will form in the asteroid belt? Zero. So the question is how many planets are there going to be in this big data universe and how many tiny little asteroids?

Of course, I'm biased because Numatix (www.numatix.org) is designed to run anywhere.

Thursday, June 7, 2012

The problem with Kaggle

I love Kaggle. I think its great in terms of its democratization and evangelism of machine learning. I think its great in that it demonstrates the breadth of applicability of machine learning.

Unfortunately, I think it trivializes much of what is truly hard about applying machine learning because many of its problems come in "pre-baked" form. In other words someone has already translated the real problem into a particular abstraction. This is done by choosing a representation and a loss function. It is also done by holding back information regarding the problem and the final test data.

What this means is that the most interesting and challenging aspects of applied machine learning have been abstracted away -- often poorly. The remaining challenges are not that interesting. All that's really left is tuning the parameters of a variety of learning algorithms.

The Netflix challenge had some of the same issues. Actually, I haven't seen any "competitions" like this that don't abstract away most of the interesting issues. I'm not certain that it can be done, but there's a lot of space to explore between where we are now and perfect.

Update: Kaggle just came out with Kaggle Prospect , which certainly takes some steps to address this concern. I'm looking forward to seeing how it goes.

Monday, June 4, 2012

Evaluating ligand-based algorithms: Noise

We as a community are acutely aware that the data we have with which to build models can be extremely noisy. There are two approaches to dealing with this. First, only use consistent, low-noise sets of data. Second, develop noise-robust fitting procedures. I don't intend to really argue for one of these here, except to note that I believe we have to take the second approach and leverage all available data.

With regard to evaluating algorithms, the first approach does not require much further discussion here. The second approach is more interesting.
The correct way to approach developing noise-robust algorithms is to begin by assessing the typical nature and quantity of noise on our problem of interest. We then develop techniques to deal with these particular kinds of noise. The reason this is the correct approach is that general robustness to arbitrary noise is hard to do, very hard. We always want to leverage our knowledge and understanding of the problem at hand to make it easier. Dealing with noise is an issue where this approach can provide significant leverage.

However, if we have developed algorithms to deal with the kinds of idiosyncratic noise that arise in our data, we can't evaluate them on data that doesn't present similar kinds of noise. Suppose for example that we have two components to our fitting procedure: a learning algorithm A, and a noise robustness module B. Now suppose we use a set of data to evaluate the relative performance of A alone against A combined with B (AB). Clearly, if the evaluation data is noise free, or exhibits different kinds of noise than those for which B was developed we should not expect AB to perform better. In fact, we can reasonably expect it to perform worse.

More generally, it is not reasonable to compare algorithms perfected for noise-free data versus algorithms perfected for noisy data on noise-free data. We must decide a priori which case is most relevant to the problem at hand and evaluate that case.

(Note that when I refer to evaluation data I mean both the training and testing data used for the evaluation.)

Evaluating ligand-based algorithms

What is the best way to evaluate or compare ligand-based algorithms? I've found myself spending quite a lot of time on this question recently. It's both interesting and hard. And most of what is done in the computational chemistry literature isn't right.


So, I'm going to write a series of blog posts aiming to clarify the issues.
Generally speaking there are three main axes:
1) What is the problem we want to solve? In other words, what are we going to use our predictions for?
2) What are the key features of the data available to us for building and evaluating models?
3) What are the appropriate statistical techniques/tests for assessing the results?


I'm going to address some of these issues in detail in later posts. Here I'd just like to say why each of these is important.


First, understanding how you're going to use your model is key. It's essential to understand that any model you build is going to make mistakes. It's going to have errors. The key to problem definition then is deciding where to put those errors, or alternatively which errors to penalize and how much. For example, suppose you're fitting atom centered partial charges for a force-field based on quantum data. You could minimize mean squared error, you could minimize the mean absolute error, or you could minimize the maximum deviation. You will get vastly different results in these different cases. Furthermore, the appropriate algorithm, representation, and data for fitting each objective function will be different. So, if what you really care about is maximum deviation, it would be a mistake to evaluate alternative fitting algorithms by measuring mean squared error.


Second, the distributional and noise properties of your data and of the underlying problem are critical to the effectiveness of algorithms and models. Therefore the experimental design needs to carefully reflect these properties. For example, if the underlying problem exhibits a particular kind of bias then the key determiner of effectiveness may be the ability to deal with that bias. If the evaluation experiment does not exhibit similar biases it will be incapable of differentiating appropriately between algorithms along this key axis.


Finally, different test designs will yield performance measures with different distributional properties, and so the appropriate statistical tests to determine effectiveness will differ. For example, Spearman's Rank Correlation on a sample is a biased estimate of the true correlation while the same is not true for Kendall's tau (under appropriate assumptions). This must be taken into account.


Designing appropriate experiments and evaluating them correctly is truly hard. I don't believe that our current state of knowledge or tools allows us to do solve this problem fully. I do believe though that standard approaches make significant errors, and that these errors are sufficient to make many of the conclusions drawn incorrect. We will have real trouble advancing the field of computational chemistry until we address these shortcomings.

Thursday, May 31, 2012

What is NoSql really?

What I think NoSql is really about is that people are starting to figure out how to horizontally scale data structures other than row structured tables. To me that's the key difference between "Sql" and "NoSql", it has less to do with how queries are structured, and much more to do with how data and indices are structured.

Of course, different data structures are only interesting because they enable different algorithms, and different algorithms imply different "query" patterns and different constraints / trade-offs with regard to, e.g., consistency.

But to me the real key is that it has become easier, and better understood, how to horizontally scale different data structures. So far the range of structures that have been tackled is still rather limited: graphs, column stores, triple/quad stores, key-value stores. And of course, the access patterns of different algorithms on these stores can have huge implications on the "right" way to implement them. It seems to me that we are just at the beginning of this effort, that the universe of data-stores will only explode further. What excites me is the idea that we can factor out the key components, to enable the rapid implementation of new data structures and new access patterns. Obviously, we can't afford/expect everyone to implement 50 or 100 different stores, but we can expect them to configure 5 orthogonal technologies to enable those 50 to 100 data structures. So what are the key components? How about these for starters:

1) Replication
2) Sharding
3) Storage with clear/configurable and possibly hierarchical notions of locality
4) The ability to move "queries" to the data.
5) A very general framework for expressing "queries". This needs to be general because the data structures are not useful with out the ability to express the relevant algorithms against them.

Obviously, all of this needs to be configurable and manageable with minimal overhead.

So how much of this does Hadoop do?

Tuesday, January 17, 2012

Monte Carlo beyond Map Reduce

I've written a couple off posts here and here about Monte Carlo methods and Map Reduce. Both have considered naive Monte Carlo in the sense that the sampling distribution is fixed and does not depend on any prior evaluations.  However, more sophisticated Monte Carlo methods do change the sampling distribution over time. Again the wikipedia article is pretty good: http://en.wikipedia.org/wiki/Monte_Carlo_integration. In these cases we iterate over the kinds of map reduce evaluations discussed previously, updating the distribution each iteration. The iterative nature of these methods make them more latency sensitive than the typical MapReduce task. If we are performing many iterations each iteration must not incur a significant start-up delay otherwise the total time may be dominated by this delay. Again this is a setting in which Numerate's distributed framework should shine.

Thursday, January 12, 2012

More Monte Carlo and Map Reduce

In a previous post I discussed Naive Monte Carlo and Map Reduce. I observed that the key challenges were how to distribute the random number generator and the reducer. Here I'd like to look at these issues a little more, specifically I'd like to look at how they become more challenging with even slightly more advanced Monte Carlo methods.

Suppose our method no longer uses samples chosen uniformly, but rather that they're from some other distribution. There are many ways of transforming a uniform distribution into another one. Some of them require additional sampling, some even require a random number of samples. So the question now arises, what are we distributing to each node? A fixed number of samples, or a fixed number of evaluations. Either way how do we deal with the variable number of samples?

Suppose further that we modify our method to also maintain an estimate of its accuracy and to stop early once the estimated accuracy is good enough. Now the mapping to Map Reduce breaks somewhat. Map reduce (as I understand it) consists of three distinct phases. In the first phase the data is scattered to a number of evaluation nodes. In the second phase these nodes evaluate some operations on the data. In the third phase the results are gathered and aggregated by the reducer. Typically these phases are conceptually distinct. Here though they overlap, in fact, the scatter phase isn't fully specified until we see some incremental results from the reducer. I haven't looked at what it would take to do this in Hadoop, but I'm guessing it would be somewhat painful.

This also gives rise to a more general question of how you express this concept in a general way that doesn't destroy opportunities for parallelism. You can't just loop until accuracy > x. This doesn't express the notion that any value for accuracy > x is sufficient rather than just the first. Typically, looping until variable > x implies that we loop until this is first true. This means every time we evaluate the loop body we have to evaluate the guard which in general makes the loop sequential. In the Monte Carlo case any value > x is fine, i.e., we can do extra iterations of the loop and so the loop is really parallel. How do we express this generally? What is the general case outside of Monte Carlo?