Thursday, May 31, 2012

What is NoSql really?

What I think NoSql is really about is that people are starting to figure out how to horizontally scale data structures other than row structured tables. To me that's the key difference between "Sql" and "NoSql", it has less to do with how queries are structured, and much more to do with how data and indices are structured.

Of course, different data structures are only interesting because they enable different algorithms, and different algorithms imply different "query" patterns and different constraints / trade-offs with regard to, e.g., consistency.

But to me the real key is that it has become easier, and better understood, how to horizontally scale different data structures. So far the range of structures that have been tackled is still rather limited: graphs, column stores, triple/quad stores, key-value stores. And of course, the access patterns of different algorithms on these stores can have huge implications on the "right" way to implement them. It seems to me that we are just at the beginning of this effort, that the universe of data-stores will only explode further. What excites me is the idea that we can factor out the key components, to enable the rapid implementation of new data structures and new access patterns. Obviously, we can't afford/expect everyone to implement 50 or 100 different stores, but we can expect them to configure 5 orthogonal technologies to enable those 50 to 100 data structures. So what are the key components? How about these for starters:

1) Replication
2) Sharding
3) Storage with clear/configurable and possibly hierarchical notions of locality
4) The ability to move "queries" to the data.
5) A very general framework for expressing "queries". This needs to be general because the data structures are not useful with out the ability to express the relevant algorithms against them.

Obviously, all of this needs to be configurable and manageable with minimal overhead.

So how much of this does Hadoop do?