Monday, June 4, 2012

Evaluating ligand-based algorithms

What is the best way to evaluate or compare ligand-based algorithms? I've found myself spending quite a lot of time on this question recently. It's both interesting and hard. And most of what is done in the computational chemistry literature isn't right.


So, I'm going to write a series of blog posts aiming to clarify the issues.
Generally speaking there are three main axes:
1) What is the problem we want to solve? In other words, what are we going to use our predictions for?
2) What are the key features of the data available to us for building and evaluating models?
3) What are the appropriate statistical techniques/tests for assessing the results?


I'm going to address some of these issues in detail in later posts. Here I'd just like to say why each of these is important.


First, understanding how you're going to use your model is key. It's essential to understand that any model you build is going to make mistakes. It's going to have errors. The key to problem definition then is deciding where to put those errors, or alternatively which errors to penalize and how much. For example, suppose you're fitting atom centered partial charges for a force-field based on quantum data. You could minimize mean squared error, you could minimize the mean absolute error, or you could minimize the maximum deviation. You will get vastly different results in these different cases. Furthermore, the appropriate algorithm, representation, and data for fitting each objective function will be different. So, if what you really care about is maximum deviation, it would be a mistake to evaluate alternative fitting algorithms by measuring mean squared error.


Second, the distributional and noise properties of your data and of the underlying problem are critical to the effectiveness of algorithms and models. Therefore the experimental design needs to carefully reflect these properties. For example, if the underlying problem exhibits a particular kind of bias then the key determiner of effectiveness may be the ability to deal with that bias. If the evaluation experiment does not exhibit similar biases it will be incapable of differentiating appropriately between algorithms along this key axis.


Finally, different test designs will yield performance measures with different distributional properties, and so the appropriate statistical tests to determine effectiveness will differ. For example, Spearman's Rank Correlation on a sample is a biased estimate of the true correlation while the same is not true for Kendall's tau (under appropriate assumptions). This must be taken into account.


Designing appropriate experiments and evaluating them correctly is truly hard. I don't believe that our current state of knowledge or tools allows us to do solve this problem fully. I do believe though that standard approaches make significant errors, and that these errors are sufficient to make many of the conclusions drawn incorrect. We will have real trouble advancing the field of computational chemistry until we address these shortcomings.

No comments:

Post a Comment