Thursday, August 25, 2011

Semantic Web

I mentioned previously that I'm working on a little side project to see what's involved in replicating Watson.

(Generally, I find these kinds of side projects a good way to get exposed to new technologies and to really learn things. Also, its a lot of fun.)

Anyway, I'm looking around to see what kinds of resources are available to be brought to bare. I started by using webharvest to scrape the Jeopardy Archive. Now I'm looking at semantic web resources. I wasn't really aware how far this stuff has been pushed. I should note that I'm fairly skeptical of the more ambitious goals of semantic web. But, there are some useful tools available. For example, DBpedia has extracted a lot of information from wikipedia into RDF format, that is, semi-structured data. There are some free triplestores that can query this data quickly, e.g., OWLIM. Some of these even do minimal inference. And they implement a standard query language.

DBpedia has a pretty good database of people. It includes things like place/date of birth, parents, books written, positions held etc. So my short term goal now is to see if I can classify the answers of Jeopardy questions into person, not person. Then I'll see if I can classify the questions into ones that have a person as an answer or not. Then I'll try to extract enough structure from the questions to see if I can generate reasonable queries against DBpedia to answer them. An alternative at that point would be to do a web search for key components of the question and look for co-occurrences of people's names. I may need to try to narrow the answer categories even more, e.g., male/female, politician/author/actor, etc.

Right now, it looks like no "rocket science" is required here. But it does look like a lot of cases would have to be handled -- its a little too brute force. Generally though, I think it will be instructive to do at least a couple of cases by this kind of brute force.

Anyway, should be fun.

My first look at Scala

I took my first look at Scala today and I like it. The mixins idea is great. I've missed operator overloading. I love functional languages. The type inference seems cool. Everything is an object is great. Lower and upper type bounds make total sense. Partial functions are very interesting.

I don't like that they kept exceptions and that side-effects are allowed. Probably pragmatic choices, but I still hope for an efficient pure functional language.

Overall, I'm pretty interested in trying it out. I might even propose that the next new non-GUI project we do at work use Scala.


Saturday, August 20, 2011

Don't tweak the DB

Generally, I hate "tweaking the DB" as a performance strategy. I realized that not everyone shares this hatred and I've been trying to verbalize my rationale.

I think the reason I hate it is because it often breaks the DB abstraction. Unfortunately, the most implicit parts of any interface are the performance considerations. In fact, performance is not even really part of the interface. It remains unspecified, undocumented and unreliable.

All of this is especially the case with a database. Nowadays we write DB code to be portable -- we insert layers and develop narrow APIs so that we can switch the underlying database at some future time to address some future consideration.

We run databases on a range of hardware, and we carefully tune the hardware to achieve the desired performance.

Our applications do not function without our DB tweaks.

What we've done is implicitly rely on the implicit performance API of a large piece of software. The performance of that software depends greatly on the underlying hardware, on what else is running on the machine, and on what other DB users are doing. None of this is well documented.

SO, given two solutions to a performance issue that are "equivalent" effort, e.g., software development effort and DB tweaking, where one is implemented across the system boundary between your application and the DB, and the other resides solely in you application -- the latter should win EVERY time IMO.

What is a CTO?

As I've noted in some other posts, I'm the CTO at a Bay Area startup. My company is somewhat noteworthy for the technical and scientific depth of our problem domain. Specifically, we apply modern machine learning and cutting edge cloud computing to predict the properties of drugs. Ok, so I'm getting to the point ...
At a startup you wear many hats and as the company grows you need to start to pass some of those hats on to others. I've been thinking about how to do this and how to separate and define the role of CTO versus V.P. of engineering.
For me the key difference is vision versus execution. The role of the CTO is to develop and effectively communicate, evangelize, and sell the company's technical vision. The role of a V.P. of engineering is to efficiently implement that vision.

Saturday, July 16, 2011

Web scraping with webharvest

I've tried my hand at web-scraping only a couple of times. Frankly, I didn't enjoy it very much. What I've done in the past was fairly brute force parsing of web-pages or working directly with the DOM. But recently I've learned a little about XPATH, XSLT, and XQUERY -- a little.

Anyway, I'm booting up a little side project -- really just a way to explore some technologies. The idea is to explore some of the problems and solutions around building a Jeopardy competitor ala Watson (http://www-03.ibm.com/innovation/us/watson/index.html). Obviously, the place to start is to get the Jeopardy questions. The only place that I could find that has a large number of these is the Jeopardy Archive (fan sourcing is great). Unfortunately, I couldn't find a downloadable database. I know I should just ask but that would be too easy (or they might say no). So I set about web-scraping. I started by looking for better tools and came across webharvest http://web-harvest.sourceforge.net/. It struck me as worth a shot.

Downloading and setting it up was easy. Getting started was also easy. I had my first XPATH query running within a few minutes and also tried out XSLT. Then I got ambitious. I set about parsing the tables that contain questions and trying to save them into a MYSQL database.

Parsing the tables and extracting them as columns rather than rows worked pretty well. Unfortunately, the documentation and error messages are a little weak so this was somewhat more cumbersome than I thought it would be.

The next problem was that the answers were embedded inside javascript and had to be parsed out using regexes. This started to get a little clunky.

Finally, I had to format them for insertion into MYSQL. This is where things got hairy. It became really clunky to format the strings in order to escape things correctly for MYSQL. In the end I got it to work -- but it was a chore.

So, I was done. I set about loading my DB. After about 4500 questions it just stopped. I still haven't figured out why. So I'm switching back to Java.

I do love XPATH though and will definitely be using that in my Java implementation.

Webharvest was great for the simpler version and for prototyping but I think once things get complicated its probably not the right tool.



I love Santa Cruz

Santa Cruz and I go back a long way. I came here first in 1992, spent eight years here from 1996 to 2004 -- loved, lost, loved again. Learned -- a lot.

I don't get down here as often as I would like, but got to spend a few days here this weekend. I don't know what it is about this town but I just love it. The people are cool, the weather is hot, the feeling is warm. Right now I'm sitting in the back of Lulu Carpenter's coffee shop. This place has been here a while (used to be Espresso Royale) and I used to come here to study and take in the world. They have the best espresso -- and right now a blue's band.

Earlier, I hiked up the back of campus. When I was here for grad school I used to run up there all the time. The campus that most people know is only the bottom 1/3rd the rest is essentially a natural forest. It is so beautiful. Full of redwoods, beautiful meadows and trails that go on for ever. Its also empty. On the lower trails you meet some mountain bikers but as you get up the back and head down towards highway 9 its empty. It could be any time in history. God I love it back there.

I've got to head home tomorrow, I'm trying to decide what to do in the morning. I think I'll either go to ride the old wooden roller coaster or I'll just go for a walk on west cliff. Actually, I think I'll do both!

Back here again with my girls in September. Awesome.

Location:Santa Cruz,United States

Maintenance Debt

I manage a group of engineers at a startup and am gearing up to start a "sustainability" push. That is I'm looking at ways to get our codebase on a more sustainable footing -- perhaps more on that later. It occurred to me that the notion of a "maintenance debt" is worth some thought. What I mean by "maintenance debt" is that maintenance, documentation, refactoring, and code quality can be deferred. As you defer it you accumulate a debt of maintenance that may need to be paid back. Obviously, this happens in other circumstances too -- my experience as president of a HOA being a painful illustration. But I think startups may be a special case.
When you start a software company you typically are under-resourced, must ship a product quickly and have little certainty about either the long-term fate of the company or its initial product. In this case, does it make sense to invest your limited resources in order to keep your maintenance debt low? Likely not.

Maintenance debt it is like "real" debt in several ways. First, "interest" must be paid, that is the longer you defer paying down the debt the larger the debt to be paid down. Second, it provides leverage on your investment dollars in exactly the same way that "real" debt does.
But maintenance debt is unlike "real" debt in other important ways. First, it doesn't appear on your balance sheet. This can be both good and bad. It's good from a financials perspective, but bad in the sense that your Board may not appreciate the value in paying it off. Second, you may get away without paying it off. That is should your project or endeavor not succeed you don't have to pay off the debt (much like a bankruptcy). Third, there can be non-linear effects on maintenance debt. For example, the departure of a key team member may greatly increase the debt burden.


The questions for me now are: is it time to start paying off the debt? what is the best way to do this? how much of it can be side-stepped?
I've been thinking about ways to avoid paying some of this debt and may write more on this later.

The other important issue is how do we think about managing maintenance debt more explicitly in the future.

Update:
I was just pointed to this related post.
http://www.codinghorror.com/blog/2009/02/paying-down-your-technical-debt.html