I mentioned previously that I'm working on a little side project to see what's involved in replicating Watson.
(Generally, I find these kinds of side projects a good way to get exposed to new technologies and to really learn things. Also, its a lot of fun.)
Anyway, I'm looking around to see what kinds of resources are available to be brought to bare. I started by using webharvest to scrape the Jeopardy Archive. Now I'm looking at semantic web resources. I wasn't really aware how far this stuff has been pushed. I should note that I'm fairly skeptical of the more ambitious goals of semantic web. But, there are some useful tools available. For example, DBpedia has extracted a lot of information from wikipedia into RDF format, that is, semi-structured data. There are some free triplestores that can query this data quickly, e.g., OWLIM. Some of these even do minimal inference. And they implement a standard query language.
DBpedia has a pretty good database of people. It includes things like place/date of birth, parents, books written, positions held etc. So my short term goal now is to see if I can classify the answers of Jeopardy questions into person, not person. Then I'll see if I can classify the questions into ones that have a person as an answer or not. Then I'll try to extract enough structure from the questions to see if I can generate reasonable queries against DBpedia to answer them. An alternative at that point would be to do a web search for key components of the question and look for co-occurrences of people's names. I may need to try to narrow the answer categories even more, e.g., male/female, politician/author/actor, etc.
Right now, it looks like no "rocket science" is required here. But it does look like a lot of cases would have to be handled -- its a little too brute force. Generally though, I think it will be instructive to do at least a couple of cases by this kind of brute force.
Anyway, should be fun.
Thursday, August 25, 2011
My first look at Scala
I took my first look at Scala today and I like it. The mixins idea is great. I've missed operator overloading. I love functional languages. The type inference seems cool. Everything is an object is great. Lower and upper type bounds make total sense. Partial functions are very interesting.
I don't like that they kept exceptions and that side-effects are allowed. Probably pragmatic choices, but I still hope for an efficient pure functional language.
Overall, I'm pretty interested in trying it out. I might even propose that the next new non-GUI project we do at work use Scala.
I don't like that they kept exceptions and that side-effects are allowed. Probably pragmatic choices, but I still hope for an efficient pure functional language.
Overall, I'm pretty interested in trying it out. I might even propose that the next new non-GUI project we do at work use Scala.
Saturday, August 20, 2011
Don't tweak the DB
Generally, I hate "tweaking the DB" as a performance strategy. I realized that not everyone shares this hatred and I've been trying to verbalize my rationale.
I think the reason I hate it is because it often breaks the DB abstraction. Unfortunately, the most implicit parts of any interface are the performance considerations. In fact, performance is not even really part of the interface. It remains unspecified, undocumented and unreliable.
All of this is especially the case with a database. Nowadays we write DB code to be portable -- we insert layers and develop narrow APIs so that we can switch the underlying database at some future time to address some future consideration.
We run databases on a range of hardware, and we carefully tune the hardware to achieve the desired performance.
Our applications do not function without our DB tweaks.
What we've done is implicitly rely on the implicit performance API of a large piece of software. The performance of that software depends greatly on the underlying hardware, on what else is running on the machine, and on what other DB users are doing. None of this is well documented.
SO, given two solutions to a performance issue that are "equivalent" effort, e.g., software development effort and DB tweaking, where one is implemented across the system boundary between your application and the DB, and the other resides solely in you application -- the latter should win EVERY time IMO.
I think the reason I hate it is because it often breaks the DB abstraction. Unfortunately, the most implicit parts of any interface are the performance considerations. In fact, performance is not even really part of the interface. It remains unspecified, undocumented and unreliable.
All of this is especially the case with a database. Nowadays we write DB code to be portable -- we insert layers and develop narrow APIs so that we can switch the underlying database at some future time to address some future consideration.
We run databases on a range of hardware, and we carefully tune the hardware to achieve the desired performance.
Our applications do not function without our DB tweaks.
What we've done is implicitly rely on the implicit performance API of a large piece of software. The performance of that software depends greatly on the underlying hardware, on what else is running on the machine, and on what other DB users are doing. None of this is well documented.
SO, given two solutions to a performance issue that are "equivalent" effort, e.g., software development effort and DB tweaking, where one is implemented across the system boundary between your application and the DB, and the other resides solely in you application -- the latter should win EVERY time IMO.
What is a CTO?
As I've noted in some other posts, I'm the CTO at a Bay Area startup. My company is somewhat noteworthy for the technical and scientific depth of our problem domain. Specifically, we apply modern machine learning and cutting edge cloud computing to predict the properties of drugs. Ok, so I'm getting to the point ...
At a startup you wear many hats and as the company grows you need to start to pass some of those hats on to others. I've been thinking about how to do this and how to separate and define the role of CTO versus V.P. of engineering.
For me the key difference is vision versus execution. The role of the CTO is to develop and effectively communicate, evangelize, and sell the company's technical vision. The role of a V.P. of engineering is to efficiently implement that vision.
At a startup you wear many hats and as the company grows you need to start to pass some of those hats on to others. I've been thinking about how to do this and how to separate and define the role of CTO versus V.P. of engineering.
For me the key difference is vision versus execution. The role of the CTO is to develop and effectively communicate, evangelize, and sell the company's technical vision. The role of a V.P. of engineering is to efficiently implement that vision.
Subscribe to:
Posts (Atom)