March 01, 2005

building the new database, pt 2

In my last entry on this topic, I discussed Bosworth's blog entry back in December calling for the "new database". In my opinion, the "new database" is perhaps a combination of three trends:

a. Emphasizing probability over logical certainty. This means fewer "queries" and more "searches" with ranking-based approaches. This, by and large, seems to be the fundamental shift underway to deal with infoglut, and it's the hardest one. It completely changes the notion of what a database is. It no longer primarily is a fact-base or 'oracle' (ahem). It becomes (mostly) a predictor, or statistician.

b. Convergence of search operations, logical set-operations, tagged data, and common programming languages. It's very difficult" to truly create good abstractions, and even then they still leak. In terms of data, I think this requires a fundamental change of language, though certainly we've tried and failed in this task many times. The closest I've seen to a truly elegant data/language unification is was with Gemstone + Smalltalk -- and I think it can be done again, better.

c. A separation of logical from physical data structures. Schemas change a lot, they're much more dynamic than the late 1980's. This means database vendors actually need to implement the relational theory as intended - where one can compose a physical data structure that does not necessarily map 1:1 to its logical structure, as almost all databases continue to do today.

I reject claims that XML databases will be the ascendent to the "new database" for many reasons that one can find elsewhere.

The above three trends are ideals that may take years to solve. But It's my belief that the "new database", in some respects, is already here, but culturally I don't believe most developers are capable of understanding it. I'm going to explore why I think this is, along with how today's databases solve the three general problems of a) dynamic schemas, b) massive scalability & data volume, c) better physical/logical separation. Each of these will be in a seperate part... let me lead off with a couple of comments on why we're in this predicament.

"If the database vendors ARE solving these problems, then they aren't doing a good job of telling the rest of us."

I think the database vendors are trying quite hard to solve these problems and communicate this. Browse through Oracle's marketing material.

"The customers I talk to who are using the traditional databases are esentially using them as very dumb row stores and trying very hard to move all the logic and searching out into arrays of machines with in memory caches."

And this is completely due to what I would consider closed-minded cultural / historical reasons. It has a seed in reality, back around, say, 1997, when databases weren't as clusterable or built for web access. And it is exacerbated by legions of database developers that haven't unlearned their bad habits from 1995. But it's unnecessary.

My observation is that all of the three major programming camps - .NET, LAMPW (Linux,Apache,MySQL,PHP/Perl/Python,Whatever), or Java - especially the latter two, seem to have an allergic reaction to relational databases, in all aspects. Relational theory & design is loathed, physical design issues are glossed over, and generally the attitude consists of covering one's eyes and yelling "la la la la la la" loudly whenever someone suggests that it might actually be useful to really learn this stuff in-depth.

Each camp seems to have its own neurotic view of the world -- whether being wedded to the one "true" database (SQL Server, MySQL, PostgreSQL, Oracle, or DB2) , or being an (object|XML) bigot and turning one's nose up at 30 years of database theory, or believing that databases altogether are a stupid idea.

There's a mix of Not-Invented-Here, hubris, fear, confusion, embarrassment, and a general lack of memory about the debates that took place through the 1980's and 1990's on this stuff. Gurus of yesteryear - Kent, Codd, Date, Darwen, Pascal, etc. - are relics to today's generation of Java and .NET programmers. People complain constantly about how difficult it is to understand XYZ database because it's so different from ABC database, when it would be clear why this is if they read the FINE and FREELY AVAILABLE online documentation for all of the major databases - Oracle, DB2, SQL Server, MySQL.

Oracle has been a FREE download for over 6 years, and people still don't experiment with it, they just fear it.

Databases are perhaps the most complicated piece of software in use, after an operating system. People spend time to learn operating systems in-depth at college. They don't usually get the same in-depth exposure to databases. Perhaps that's one of the problems? Or is there just a religious fervour in the air?

All of the arguments paraded around Wikis and mailing lists and blogs on the "right" data format and "relations vs. objects vs. XML" was hashed over 20 years ago, and sadly it seems the intellectual results of that debate are widely scattered in journals, books, and out-of-print articles. All that's left is the observation that relational databases (or "SQL databases") triumphed in the marketplace, while objects triumphed in the minds of developers. Network/object databases are a small niche, and a tremendously entrenched number of hierarchical / flat databases on mainframes continue to demonstrate the incredible power of IT managers who just. don't. care.

But these worlds - programming, data management, and data interchange - don't need to be in opposition to one another. They can be complementary. And hopefully someone will figure out to find their appropriate strengths and unify them into the next great programming environment.

Posted by stu at March 1, 2005 02:21 AM