Saturday, July 16, 2011

Web scraping with webharvest

I've tried my hand at web-scraping only a couple of times. Frankly, I didn't enjoy it very much. What I've done in the past was fairly brute force parsing of web-pages or working directly with the DOM. But recently I've learned a little about XPATH, XSLT, and XQUERY -- a little.

Anyway, I'm booting up a little side project -- really just a way to explore some technologies. The idea is to explore some of the problems and solutions around building a Jeopardy competitor ala Watson (http://www-03.ibm.com/innovation/us/watson/index.html). Obviously, the place to start is to get the Jeopardy questions. The only place that I could find that has a large number of these is the Jeopardy Archive (fan sourcing is great). Unfortunately, I couldn't find a downloadable database. I know I should just ask but that would be too easy (or they might say no). So I set about web-scraping. I started by looking for better tools and came across webharvest http://web-harvest.sourceforge.net/. It struck me as worth a shot.

Downloading and setting it up was easy. Getting started was also easy. I had my first XPATH query running within a few minutes and also tried out XSLT. Then I got ambitious. I set about parsing the tables that contain questions and trying to save them into a MYSQL database.

Parsing the tables and extracting them as columns rather than rows worked pretty well. Unfortunately, the documentation and error messages are a little weak so this was somewhat more cumbersome than I thought it would be.

The next problem was that the answers were embedded inside javascript and had to be parsed out using regexes. This started to get a little clunky.

Finally, I had to format them for insertion into MYSQL. This is where things got hairy. It became really clunky to format the strings in order to escape things correctly for MYSQL. In the end I got it to work -- but it was a chore.

So, I was done. I set about loading my DB. After about 4500 questions it just stopped. I still haven't figured out why. So I'm switching back to Java.

I do love XPATH though and will definitely be using that in my Java implementation.

Webharvest was great for the simpler version and for prototyping but I think once things get complicated its probably not the right tool.



I love Santa Cruz

Santa Cruz and I go back a long way. I came here first in 1992, spent eight years here from 1996 to 2004 -- loved, lost, loved again. Learned -- a lot.

I don't get down here as often as I would like, but got to spend a few days here this weekend. I don't know what it is about this town but I just love it. The people are cool, the weather is hot, the feeling is warm. Right now I'm sitting in the back of Lulu Carpenter's coffee shop. This place has been here a while (used to be Espresso Royale) and I used to come here to study and take in the world. They have the best espresso -- and right now a blue's band.

Earlier, I hiked up the back of campus. When I was here for grad school I used to run up there all the time. The campus that most people know is only the bottom 1/3rd the rest is essentially a natural forest. It is so beautiful. Full of redwoods, beautiful meadows and trails that go on for ever. Its also empty. On the lower trails you meet some mountain bikers but as you get up the back and head down towards highway 9 its empty. It could be any time in history. God I love it back there.

I've got to head home tomorrow, I'm trying to decide what to do in the morning. I think I'll either go to ride the old wooden roller coaster or I'll just go for a walk on west cliff. Actually, I think I'll do both!

Back here again with my girls in September. Awesome.

Location:Santa Cruz,United States

Maintenance Debt

I manage a group of engineers at a startup and am gearing up to start a "sustainability" push. That is I'm looking at ways to get our codebase on a more sustainable footing -- perhaps more on that later. It occurred to me that the notion of a "maintenance debt" is worth some thought. What I mean by "maintenance debt" is that maintenance, documentation, refactoring, and code quality can be deferred. As you defer it you accumulate a debt of maintenance that may need to be paid back. Obviously, this happens in other circumstances too -- my experience as president of a HOA being a painful illustration. But I think startups may be a special case.
When you start a software company you typically are under-resourced, must ship a product quickly and have little certainty about either the long-term fate of the company or its initial product. In this case, does it make sense to invest your limited resources in order to keep your maintenance debt low? Likely not.

Maintenance debt it is like "real" debt in several ways. First, "interest" must be paid, that is the longer you defer paying down the debt the larger the debt to be paid down. Second, it provides leverage on your investment dollars in exactly the same way that "real" debt does.
But maintenance debt is unlike "real" debt in other important ways. First, it doesn't appear on your balance sheet. This can be both good and bad. It's good from a financials perspective, but bad in the sense that your Board may not appreciate the value in paying it off. Second, you may get away without paying it off. That is should your project or endeavor not succeed you don't have to pay off the debt (much like a bankruptcy). Third, there can be non-linear effects on maintenance debt. For example, the departure of a key team member may greatly increase the debt burden.


The questions for me now are: is it time to start paying off the debt? what is the best way to do this? how much of it can be side-stepped?
I've been thinking about ways to avoid paying some of this debt and may write more on this later.

The other important issue is how do we think about managing maintenance debt more explicitly in the future.

Update:
I was just pointed to this related post.
http://www.codinghorror.com/blog/2009/02/paying-down-your-technical-debt.html