I took my first look at Scala today and I like it. The mixins idea is great. I've missed operator overloading. I love functional languages. The type inference seems cool. Everything is an object is great. Lower and upper type bounds make total sense. Partial functions are very interesting.
I don't like that they kept exceptions and that side-effects are allowed. Probably pragmatic choices, but I still hope for an efficient pure functional language.
Overall, I'm pretty interested in trying it out. I might even propose that the next new non-GUI project we do at work use Scala.
Thursday, August 25, 2011
Saturday, August 20, 2011
Don't tweak the DB
Generally, I hate "tweaking the DB" as a performance strategy. I realized that not everyone shares this hatred and I've been trying to verbalize my rationale.
I think the reason I hate it is because it often breaks the DB abstraction. Unfortunately, the most implicit parts of any interface are the performance considerations. In fact, performance is not even really part of the interface. It remains unspecified, undocumented and unreliable.
All of this is especially the case with a database. Nowadays we write DB code to be portable -- we insert layers and develop narrow APIs so that we can switch the underlying database at some future time to address some future consideration.
We run databases on a range of hardware, and we carefully tune the hardware to achieve the desired performance.
Our applications do not function without our DB tweaks.
What we've done is implicitly rely on the implicit performance API of a large piece of software. The performance of that software depends greatly on the underlying hardware, on what else is running on the machine, and on what other DB users are doing. None of this is well documented.
SO, given two solutions to a performance issue that are "equivalent" effort, e.g., software development effort and DB tweaking, where one is implemented across the system boundary between your application and the DB, and the other resides solely in you application -- the latter should win EVERY time IMO.
I think the reason I hate it is because it often breaks the DB abstraction. Unfortunately, the most implicit parts of any interface are the performance considerations. In fact, performance is not even really part of the interface. It remains unspecified, undocumented and unreliable.
All of this is especially the case with a database. Nowadays we write DB code to be portable -- we insert layers and develop narrow APIs so that we can switch the underlying database at some future time to address some future consideration.
We run databases on a range of hardware, and we carefully tune the hardware to achieve the desired performance.
Our applications do not function without our DB tweaks.
What we've done is implicitly rely on the implicit performance API of a large piece of software. The performance of that software depends greatly on the underlying hardware, on what else is running on the machine, and on what other DB users are doing. None of this is well documented.
SO, given two solutions to a performance issue that are "equivalent" effort, e.g., software development effort and DB tweaking, where one is implemented across the system boundary between your application and the DB, and the other resides solely in you application -- the latter should win EVERY time IMO.
What is a CTO?
As I've noted in some other posts, I'm the CTO at a Bay Area startup. My company is somewhat noteworthy for the technical and scientific depth of our problem domain. Specifically, we apply modern machine learning and cutting edge cloud computing to predict the properties of drugs. Ok, so I'm getting to the point ...
At a startup you wear many hats and as the company grows you need to start to pass some of those hats on to others. I've been thinking about how to do this and how to separate and define the role of CTO versus V.P. of engineering.
For me the key difference is vision versus execution. The role of the CTO is to develop and effectively communicate, evangelize, and sell the company's technical vision. The role of a V.P. of engineering is to efficiently implement that vision.
At a startup you wear many hats and as the company grows you need to start to pass some of those hats on to others. I've been thinking about how to do this and how to separate and define the role of CTO versus V.P. of engineering.
For me the key difference is vision versus execution. The role of the CTO is to develop and effectively communicate, evangelize, and sell the company's technical vision. The role of a V.P. of engineering is to efficiently implement that vision.
Saturday, July 16, 2011
Web scraping with webharvest
I've tried my hand at web-scraping only a couple of times. Frankly, I didn't enjoy it very much. What I've done in the past was fairly brute force parsing of web-pages or working directly with the DOM. But recently I've learned a little about XPATH, XSLT, and XQUERY -- a little.
Anyway, I'm booting up a little side project -- really just a way to explore some technologies. The idea is to explore some of the problems and solutions around building a Jeopardy competitor ala Watson (http://www-03.ibm.com/innovation/us/watson/index.html). Obviously, the place to start is to get the Jeopardy questions. The only place that I could find that has a large number of these is the Jeopardy Archive (fan sourcing is great). Unfortunately, I couldn't find a downloadable database. I know I should just ask but that would be too easy (or they might say no). So I set about web-scraping. I started by looking for better tools and came across webharvest http://web-harvest.sourceforge.net/. It struck me as worth a shot.
Downloading and setting it up was easy. Getting started was also easy. I had my first XPATH query running within a few minutes and also tried out XSLT. Then I got ambitious. I set about parsing the tables that contain questions and trying to save them into a MYSQL database.
Parsing the tables and extracting them as columns rather than rows worked pretty well. Unfortunately, the documentation and error messages are a little weak so this was somewhat more cumbersome than I thought it would be.
The next problem was that the answers were embedded inside javascript and had to be parsed out using regexes. This started to get a little clunky.
Finally, I had to format them for insertion into MYSQL. This is where things got hairy. It became really clunky to format the strings in order to escape things correctly for MYSQL. In the end I got it to work -- but it was a chore.
So, I was done. I set about loading my DB. After about 4500 questions it just stopped. I still haven't figured out why. So I'm switching back to Java.
I do love XPATH though and will definitely be using that in my Java implementation.
Webharvest was great for the simpler version and for prototyping but I think once things get complicated its probably not the right tool.
Anyway, I'm booting up a little side project -- really just a way to explore some technologies. The idea is to explore some of the problems and solutions around building a Jeopardy competitor ala Watson (http://www-03.ibm.com/innovation/us/watson/index.html). Obviously, the place to start is to get the Jeopardy questions. The only place that I could find that has a large number of these is the Jeopardy Archive (fan sourcing is great). Unfortunately, I couldn't find a downloadable database. I know I should just ask but that would be too easy (or they might say no). So I set about web-scraping. I started by looking for better tools and came across webharvest http://web-harvest.sourceforge.net/. It struck me as worth a shot.
Downloading and setting it up was easy. Getting started was also easy. I had my first XPATH query running within a few minutes and also tried out XSLT. Then I got ambitious. I set about parsing the tables that contain questions and trying to save them into a MYSQL database.
Parsing the tables and extracting them as columns rather than rows worked pretty well. Unfortunately, the documentation and error messages are a little weak so this was somewhat more cumbersome than I thought it would be.
The next problem was that the answers were embedded inside javascript and had to be parsed out using regexes. This started to get a little clunky.
Finally, I had to format them for insertion into MYSQL. This is where things got hairy. It became really clunky to format the strings in order to escape things correctly for MYSQL. In the end I got it to work -- but it was a chore.
So, I was done. I set about loading my DB. After about 4500 questions it just stopped. I still haven't figured out why. So I'm switching back to Java.
I do love XPATH though and will definitely be using that in my Java implementation.
Webharvest was great for the simpler version and for prototyping but I think once things get complicated its probably not the right tool.
I love Santa Cruz
Santa Cruz and I go back a long way. I came here first in 1992, spent eight years here from 1996 to 2004 -- loved, lost, loved again. Learned -- a lot.
I don't get down here as often as I would like, but got to spend a few days here this weekend. I don't know what it is about this town but I just love it. The people are cool, the weather is hot, the feeling is warm. Right now I'm sitting in the back of Lulu Carpenter's coffee shop. This place has been here a while (used to be Espresso Royale) and I used to come here to study and take in the world. They have the best espresso -- and right now a blue's band.
Earlier, I hiked up the back of campus. When I was here for grad school I used to run up there all the time. The campus that most people know is only the bottom 1/3rd the rest is essentially a natural forest. It is so beautiful. Full of redwoods, beautiful meadows and trails that go on for ever. Its also empty. On the lower trails you meet some mountain bikers but as you get up the back and head down towards highway 9 its empty. It could be any time in history. God I love it back there.
I've got to head home tomorrow, I'm trying to decide what to do in the morning. I think I'll either go to ride the old wooden roller coaster or I'll just go for a walk on west cliff. Actually, I think I'll do both!
Back here again with my girls in September. Awesome.
I don't get down here as often as I would like, but got to spend a few days here this weekend. I don't know what it is about this town but I just love it. The people are cool, the weather is hot, the feeling is warm. Right now I'm sitting in the back of Lulu Carpenter's coffee shop. This place has been here a while (used to be Espresso Royale) and I used to come here to study and take in the world. They have the best espresso -- and right now a blue's band.
Earlier, I hiked up the back of campus. When I was here for grad school I used to run up there all the time. The campus that most people know is only the bottom 1/3rd the rest is essentially a natural forest. It is so beautiful. Full of redwoods, beautiful meadows and trails that go on for ever. Its also empty. On the lower trails you meet some mountain bikers but as you get up the back and head down towards highway 9 its empty. It could be any time in history. God I love it back there.
I've got to head home tomorrow, I'm trying to decide what to do in the morning. I think I'll either go to ride the old wooden roller coaster or I'll just go for a walk on west cliff. Actually, I think I'll do both!
Back here again with my girls in September. Awesome.
Location:Santa Cruz,United States
Maintenance Debt
I manage a group of engineers at a startup and am gearing up to start a "sustainability" push. That is I'm looking at ways to get our codebase on a more sustainable footing -- perhaps more on that later. It occurred to me that the notion of a "maintenance debt" is worth some thought. What I mean by "maintenance debt" is that maintenance, documentation, refactoring, and code quality can be deferred. As you defer it you accumulate a debt of maintenance that may need to be paid back. Obviously, this happens in other circumstances too -- my experience as president of a HOA being a painful illustration. But I think startups may be a special case.
When you start a software company you typically are under-resourced, must ship a product quickly and have little certainty about either the long-term fate of the company or its initial product. In this case, does it make sense to invest your limited resources in order to keep your maintenance debt low? Likely not.
Maintenance debt it is like "real" debt in several ways. First, "interest" must be paid, that is the longer you defer paying down the debt the larger the debt to be paid down. Second, it provides leverage on your investment dollars in exactly the same way that "real" debt does.
But maintenance debt is unlike "real" debt in other important ways. First, it doesn't appear on your balance sheet. This can be both good and bad. It's good from a financials perspective, but bad in the sense that your Board may not appreciate the value in paying it off. Second, you may get away without paying it off. That is should your project or endeavor not succeed you don't have to pay off the debt (much like a bankruptcy). Third, there can be non-linear effects on maintenance debt. For example, the departure of a key team member may greatly increase the debt burden.
When you start a software company you typically are under-resourced, must ship a product quickly and have little certainty about either the long-term fate of the company or its initial product. In this case, does it make sense to invest your limited resources in order to keep your maintenance debt low? Likely not.
Maintenance debt it is like "real" debt in several ways. First, "interest" must be paid, that is the longer you defer paying down the debt the larger the debt to be paid down. Second, it provides leverage on your investment dollars in exactly the same way that "real" debt does.
But maintenance debt is unlike "real" debt in other important ways. First, it doesn't appear on your balance sheet. This can be both good and bad. It's good from a financials perspective, but bad in the sense that your Board may not appreciate the value in paying it off. Second, you may get away without paying it off. That is should your project or endeavor not succeed you don't have to pay off the debt (much like a bankruptcy). Third, there can be non-linear effects on maintenance debt. For example, the departure of a key team member may greatly increase the debt burden.
The questions for me now are: is it time to start paying off the debt? what is the best way to do this? how much of it can be side-stepped?
I've been thinking about ways to avoid paying some of this debt and may write more on this later.
The other important issue is how do we think about managing maintenance debt more explicitly in the future.
Update:
I was just pointed to this related post.
http://www.codinghorror.com/blog/2009/02/paying-down-your-technical-debt.html
Saturday, November 28, 2009
Nurture Shock
Its funny how you learn the same lesson at the same time from a number of different sources. I just started reading Nurture Shock and the first section is about how inappropriate praise can subvert a child's motivation. The main lesson is that parents need to be careful about the way they praise their children and the extent of that praise. Specifically, praise should provide guidance (as should criticism) so that children are motivated to continue a good behavior or end a bad one. Non-specific praise can teach kids that they have no control. For example, "you're so smart" teaches kids that their success relies on their innate ability. The book runs through several examples of motivating praise and demonstrates how such motivation can encourage kids to work hard and persevere. The same theme of hard work and perseverence runs through outliers.
I've just started reading the chapter on lack of sleep. Interesting so far ...
I've just started reading the chapter on lack of sleep. Interesting so far ...
Subscribe to:
Comments (Atom)
