Thursday, June 7, 2012

The problem with Kaggle

I love Kaggle. I think its great in terms of its democratization and evangelism of machine learning. I think its great in that it demonstrates the breadth of applicability of machine learning.

Unfortunately, I think it trivializes much of what is truly hard about applying machine learning because many of its problems come in "pre-baked" form. In other words someone has already translated the real problem into a particular abstraction. This is done by choosing a representation and a loss function. It is also done by holding back information regarding the problem and the final test data.

What this means is that the most interesting and challenging aspects of applied machine learning have been abstracted away -- often poorly. The remaining challenges are not that interesting. All that's really left is tuning the parameters of a variety of learning algorithms.

The Netflix challenge had some of the same issues. Actually, I haven't seen any "competitions" like this that don't abstract away most of the interesting issues. I'm not certain that it can be done, but there's a lot of space to explore between where we are now and perfect.

Update: Kaggle just came out with Kaggle Prospect , which certainly takes some steps to address this concern. I'm looking forward to seeing how it goes.

No comments:

Post a Comment