Saturday, July 16, 2011

Web scraping with webharvest

I've tried my hand at web-scraping only a couple of times. Frankly, I didn't enjoy it very much. What I've done in the past was fairly brute force parsing of web-pages or working directly with the DOM. But recently I've learned a little about XPATH, XSLT, and XQUERY -- a little.

Anyway, I'm booting up a little side project -- really just a way to explore some technologies. The idea is to explore some of the problems and solutions around building a Jeopardy competitor ala Watson (http://www-03.ibm.com/innovation/us/watson/index.html). Obviously, the place to start is to get the Jeopardy questions. The only place that I could find that has a large number of these is the Jeopardy Archive (fan sourcing is great). Unfortunately, I couldn't find a downloadable database. I know I should just ask but that would be too easy (or they might say no). So I set about web-scraping. I started by looking for better tools and came across webharvest http://web-harvest.sourceforge.net/. It struck me as worth a shot.

Downloading and setting it up was easy. Getting started was also easy. I had my first XPATH query running within a few minutes and also tried out XSLT. Then I got ambitious. I set about parsing the tables that contain questions and trying to save them into a MYSQL database.

Parsing the tables and extracting them as columns rather than rows worked pretty well. Unfortunately, the documentation and error messages are a little weak so this was somewhat more cumbersome than I thought it would be.

The next problem was that the answers were embedded inside javascript and had to be parsed out using regexes. This started to get a little clunky.

Finally, I had to format them for insertion into MYSQL. This is where things got hairy. It became really clunky to format the strings in order to escape things correctly for MYSQL. In the end I got it to work -- but it was a chore.

So, I was done. I set about loading my DB. After about 4500 questions it just stopped. I still haven't figured out why. So I'm switching back to Java.

I do love XPATH though and will definitely be using that in my Java implementation.

Webharvest was great for the simpler version and for prototyping but I think once things get complicated its probably not the right tool.



No comments:

Post a Comment