Thursday, August 25, 2011

Semantic Web

I mentioned previously that I'm working on a little side project to see what's involved in replicating Watson.

(Generally, I find these kinds of side projects a good way to get exposed to new technologies and to really learn things. Also, its a lot of fun.)

Anyway, I'm looking around to see what kinds of resources are available to be brought to bare. I started by using webharvest to scrape the Jeopardy Archive. Now I'm looking at semantic web resources. I wasn't really aware how far this stuff has been pushed. I should note that I'm fairly skeptical of the more ambitious goals of semantic web. But, there are some useful tools available. For example, DBpedia has extracted a lot of information from wikipedia into RDF format, that is, semi-structured data. There are some free triplestores that can query this data quickly, e.g., OWLIM. Some of these even do minimal inference. And they implement a standard query language.

DBpedia has a pretty good database of people. It includes things like place/date of birth, parents, books written, positions held etc. So my short term goal now is to see if I can classify the answers of Jeopardy questions into person, not person. Then I'll see if I can classify the questions into ones that have a person as an answer or not. Then I'll try to extract enough structure from the questions to see if I can generate reasonable queries against DBpedia to answer them. An alternative at that point would be to do a web search for key components of the question and look for co-occurrences of people's names. I may need to try to narrow the answer categories even more, e.g., male/female, politician/author/actor, etc.

Right now, it looks like no "rocket science" is required here. But it does look like a lot of cases would have to be handled -- its a little too brute force. Generally though, I think it will be instructive to do at least a couple of cases by this kind of brute force.

Anyway, should be fun.

No comments:

Post a Comment