Wednesday, June 27, 2012
The funny thing about gravity
Of course, this neglects the fact that data has mass NOW, not just in the future. Big corporations have lots of data, they cannot move it to your service. You'll have to move your service to their data.
Well, maybe big corporations aren't your customer base. Then you're only competing with Amazon and the other providers who already have a lot of mass. I, for one, am turned off by services that won't let me store my data in S3. You see, I can't have a copy of my data for every application vendor / service provider. So I need you to use my data in some central place where you can all get at it.
The accretion of mass has this positive feedback to it. Once there are a few massive entities, they will accumulate essentially all of the mass. What are the odds that a new planet will form in the asteroid belt? Zero. So the question is how many planets are there going to be in this big data universe and how many tiny little asteroids?
Of course, I'm biased because Numatix (www.numatix.org) is designed to run anywhere.
Thursday, June 7, 2012
The problem with Kaggle
Unfortunately, I think it trivializes much of what is truly hard about applying machine learning because many of its problems come in "pre-baked" form. In other words someone has already translated the real problem into a particular abstraction. This is done by choosing a representation and a loss function. It is also done by holding back information regarding the problem and the final test data.
What this means is that the most interesting and challenging aspects of applied machine learning have been abstracted away -- often poorly. The remaining challenges are not that interesting. All that's really left is tuning the parameters of a variety of learning algorithms.
The Netflix challenge had some of the same issues. Actually, I haven't seen any "competitions" like this that don't abstract away most of the interesting issues. I'm not certain that it can be done, but there's a lot of space to explore between where we are now and perfect.
Update: Kaggle just came out with Kaggle Prospect , which certainly takes some steps to address this concern. I'm looking forward to seeing how it goes.
Monday, June 4, 2012
Evaluating ligand-based algorithms: Noise
With regard to evaluating algorithms, the first approach does not require much further discussion here. The second approach is more interesting.
The correct way to approach developing noise-robust algorithms is to begin by assessing the typical nature and quantity of noise on our problem of interest. We then develop techniques to deal with these particular kinds of noise. The reason this is the correct approach is that general robustness to arbitrary noise is hard to do, very hard. We always want to leverage our knowledge and understanding of the problem at hand to make it easier. Dealing with noise is an issue where this approach can provide significant leverage.
However, if we have developed algorithms to deal with the kinds of idiosyncratic noise that arise in our data, we can't evaluate them on data that doesn't present similar kinds of noise. Suppose for example that we have two components to our fitting procedure: a learning algorithm A, and a noise robustness module B. Now suppose we use a set of data to evaluate the relative performance of A alone against A combined with B (AB). Clearly, if the evaluation data is noise free, or exhibits different kinds of noise than those for which B was developed we should not expect AB to perform better. In fact, we can reasonably expect it to perform worse.
More generally, it is not reasonable to compare algorithms perfected for noise-free data versus algorithms perfected for noisy data on noise-free data. We must decide a priori which case is most relevant to the problem at hand and evaluate that case.
(Note that when I refer to evaluation data I mean both the training and testing data used for the evaluation.)
Evaluating ligand-based algorithms
So, I'm going to write a series of blog posts aiming to clarify the issues.
Generally speaking there are three main axes:
1) What is the problem we want to solve? In other words, what are we going to use our predictions for?
2) What are the key features of the data available to us for building and evaluating models?
3) What are the appropriate statistical techniques/tests for assessing the results?
I'm going to address some of these issues in detail in later posts. Here I'd just like to say why each of these is important.
First, understanding how you're going to use your model is key. It's essential to understand that any model you build is going to make mistakes. It's going to have errors. The key to problem definition then is deciding where to put those errors, or alternatively which errors to penalize and how much. For example, suppose you're fitting atom centered partial charges for a force-field based on quantum data. You could minimize mean squared error, you could minimize the mean absolute error, or you could minimize the maximum deviation. You will get vastly different results in these different cases. Furthermore, the appropriate algorithm, representation, and data for fitting each objective function will be different. So, if what you really care about is maximum deviation, it would be a mistake to evaluate alternative fitting algorithms by measuring mean squared error.
Second, the distributional and noise properties of your data and of the underlying problem are critical to the effectiveness of algorithms and models. Therefore the experimental design needs to carefully reflect these properties. For example, if the underlying problem exhibits a particular kind of bias then the key determiner of effectiveness may be the ability to deal with that bias. If the evaluation experiment does not exhibit similar biases it will be incapable of differentiating appropriately between algorithms along this key axis.
Finally, different test designs will yield performance measures with different distributional properties, and so the appropriate statistical tests to determine effectiveness will differ. For example, Spearman's Rank Correlation on a sample is a biased estimate of the true correlation while the same is not true for Kendall's tau (under appropriate assumptions). This must be taken into account.
Designing appropriate experiments and evaluating them correctly is truly hard. I don't believe that our current state of knowledge or tools allows us to do solve this problem fully. I do believe though that standard approaches make significant errors, and that these errors are sufficient to make many of the conclusions drawn incorrect. We will have real trouble advancing the field of computational chemistry until we address these shortcomings.