Wednesday, June 27, 2012

The funny thing about gravity

We've seen a lot of talk about Data Gravity recently. Essentially the idea is that data has inertia, i.e., once your repository gets big enough its very hard to move. I buy the concept wholeheartedly. It makes a lot of sense. This obviously puts the owners of the repositories (think Amazon's S3) in a great position. They could start charging usurious rents, or more likely, they will end up sucking all of the applications and developers onto their platform. The seemingly obvious conclusion then for any big-data start up is that you must hold the data. You must start accumulating mass. You must require your customers to import their data into your service in order to use it. That way, they're trapped and you've gained mass and inertia. And this seems key to the business model of many new big data startups.

Of course, this neglects the fact that data has mass NOW, not just in the future. Big corporations have lots of data, they cannot move it to your service. You'll have to move your service to their data.

Well, maybe big corporations aren't your customer base. Then you're only competing with Amazon and the other providers who already have a lot of mass. I, for one, am turned off by services that won't let me store my data in S3. You see, I can't have a copy of my data for every application vendor / service provider. So I need you to use my data in some central place where you can all get at it.

The accretion of mass has this positive feedback to it. Once there are a few massive entities, they will accumulate essentially all of the mass. What are the odds that a new planet will form in the asteroid belt? Zero. So the question is how many planets are there going to be in this big data universe and how many tiny little asteroids?

Of course, I'm biased because Numatix (www.numatix.org) is designed to run anywhere.

Thursday, June 7, 2012

The problem with Kaggle

I love Kaggle. I think its great in terms of its democratization and evangelism of machine learning. I think its great in that it demonstrates the breadth of applicability of machine learning.

Unfortunately, I think it trivializes much of what is truly hard about applying machine learning because many of its problems come in "pre-baked" form. In other words someone has already translated the real problem into a particular abstraction. This is done by choosing a representation and a loss function. It is also done by holding back information regarding the problem and the final test data.

What this means is that the most interesting and challenging aspects of applied machine learning have been abstracted away -- often poorly. The remaining challenges are not that interesting. All that's really left is tuning the parameters of a variety of learning algorithms.

The Netflix challenge had some of the same issues. Actually, I haven't seen any "competitions" like this that don't abstract away most of the interesting issues. I'm not certain that it can be done, but there's a lot of space to explore between where we are now and perfect.

Update: Kaggle just came out with Kaggle Prospect , which certainly takes some steps to address this concern. I'm looking forward to seeing how it goes.

Monday, June 4, 2012

Evaluating ligand-based algorithms: Noise

We as a community are acutely aware that the data we have with which to build models can be extremely noisy. There are two approaches to dealing with this. First, only use consistent, low-noise sets of data. Second, develop noise-robust fitting procedures. I don't intend to really argue for one of these here, except to note that I believe we have to take the second approach and leverage all available data.

With regard to evaluating algorithms, the first approach does not require much further discussion here. The second approach is more interesting.
The correct way to approach developing noise-robust algorithms is to begin by assessing the typical nature and quantity of noise on our problem of interest. We then develop techniques to deal with these particular kinds of noise. The reason this is the correct approach is that general robustness to arbitrary noise is hard to do, very hard. We always want to leverage our knowledge and understanding of the problem at hand to make it easier. Dealing with noise is an issue where this approach can provide significant leverage.

However, if we have developed algorithms to deal with the kinds of idiosyncratic noise that arise in our data, we can't evaluate them on data that doesn't present similar kinds of noise. Suppose for example that we have two components to our fitting procedure: a learning algorithm A, and a noise robustness module B. Now suppose we use a set of data to evaluate the relative performance of A alone against A combined with B (AB). Clearly, if the evaluation data is noise free, or exhibits different kinds of noise than those for which B was developed we should not expect AB to perform better. In fact, we can reasonably expect it to perform worse.

More generally, it is not reasonable to compare algorithms perfected for noise-free data versus algorithms perfected for noisy data on noise-free data. We must decide a priori which case is most relevant to the problem at hand and evaluate that case.

(Note that when I refer to evaluation data I mean both the training and testing data used for the evaluation.)

Evaluating ligand-based algorithms

What is the best way to evaluate or compare ligand-based algorithms? I've found myself spending quite a lot of time on this question recently. It's both interesting and hard. And most of what is done in the computational chemistry literature isn't right.


So, I'm going to write a series of blog posts aiming to clarify the issues.
Generally speaking there are three main axes:
1) What is the problem we want to solve? In other words, what are we going to use our predictions for?
2) What are the key features of the data available to us for building and evaluating models?
3) What are the appropriate statistical techniques/tests for assessing the results?


I'm going to address some of these issues in detail in later posts. Here I'd just like to say why each of these is important.


First, understanding how you're going to use your model is key. It's essential to understand that any model you build is going to make mistakes. It's going to have errors. The key to problem definition then is deciding where to put those errors, or alternatively which errors to penalize and how much. For example, suppose you're fitting atom centered partial charges for a force-field based on quantum data. You could minimize mean squared error, you could minimize the mean absolute error, or you could minimize the maximum deviation. You will get vastly different results in these different cases. Furthermore, the appropriate algorithm, representation, and data for fitting each objective function will be different. So, if what you really care about is maximum deviation, it would be a mistake to evaluate alternative fitting algorithms by measuring mean squared error.


Second, the distributional and noise properties of your data and of the underlying problem are critical to the effectiveness of algorithms and models. Therefore the experimental design needs to carefully reflect these properties. For example, if the underlying problem exhibits a particular kind of bias then the key determiner of effectiveness may be the ability to deal with that bias. If the evaluation experiment does not exhibit similar biases it will be incapable of differentiating appropriately between algorithms along this key axis.


Finally, different test designs will yield performance measures with different distributional properties, and so the appropriate statistical tests to determine effectiveness will differ. For example, Spearman's Rank Correlation on a sample is a biased estimate of the true correlation while the same is not true for Kendall's tau (under appropriate assumptions). This must be taken into account.


Designing appropriate experiments and evaluating them correctly is truly hard. I don't believe that our current state of knowledge or tools allows us to do solve this problem fully. I do believe though that standard approaches make significant errors, and that these errors are sufficient to make many of the conclusions drawn incorrect. We will have real trouble advancing the field of computational chemistry until we address these shortcomings.