January 30, 2008

Relations in the cloud

I've been hearing a lot about how the RDBMS are no longer appropriate for data management on the Web. I'm curious about this.

Future users of megadata should be protected from having to know how the data is organized in the computing cloud. A prompting service which supplies such information is not a satisfactory solution.

Activities of users through web browsers and most application programs
should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed. Changes in data representation will often be needed as a result of changes in query, update, and report traffic and natural growth in the types of stored information.

I didn't write the above, it was (mostly) said 38 years ago. I think the arguments still hold up. Sure, Google and Yahoo! make do with their custom database. But, are these general-purpose? Do they suffer from the same problems of prior data stores in the 60's?

Certainly there's a balance of transparency vs. abstraction here that we need to consider: does a network-based data grid make a logical view of data impossible due to inherent limitations of distribution?

I'm not so sure. To me this is just a matter of adjusting one's data design to incorporate estimates, defaults, or dynamically assessed values when portions of the data are unavailable or inconsistent. If we don't preserve logical relationships in as simple a way as possible, aren't we just making our lives more complicated and our systems more brittle?

I do agree that there's a lot to be said about throwing out the classic RDBMS implementation assumptions of N=1 data sets, ACID constraints at all times, etc.

I do not agree that it's time to throw out the Relational model. It would be like saying "we need to throw out this so-called 'logic' to get any real work done around here".

There is a fad afoot that "everything that Amazon, Google, eBay, Yahoo!, SixApart, etc. does is goodness". I think there is a lot of merit in studying their approaches to scaling questions, but I'm not sure their solutions are always general purpose.

For example, eBay doesn't enable referential integrity in the database, or use transactions - they handle it all in the application layer. But, that doesn't always seem right to me. I've seen cases where serious mistakes were made in the object model because the integrity constraints weren't well thought out. Yes, it may be what was necessary at eBay's scale due to the limits of the Oracle's implementation of these things, but is this what everyone should do? Would it not be better long-term if we improved the underlying data management platform? I'm concerned to see a lot of people talking about custom-integrity, denormalization, and custom-consistency code as a pillar of the new reality of life in the cloud instead of a temporary aberration while we shift our data management systems to this new grid/cloud-focused physical architecture. Or perhaps this is all they've known, and the database never actually enforced anything for them. I recall back in 1997, a room full of AS/400 developers were being introduced to this new, crazy "automated referential integrity" idea, so it's not obvious to everyone.

The big problem is that inconsistency speeds data decay. Increasingly poor quality data leads to lost opportunities and poor customer satisfaction. I hope people remember that the key word in eventual consistency is eventual. Not some kind of caricatured "you can't be consistent if you hope to scale" argument.

Perhaps this is just due to historical misunderstanding. The performance of de-normalization and avoiding joins has nothing to do with the model itself, it has to do with the way the physical databases have been traditionally constrained. On the bright side, column-oriented stores are becoming more popular, so perhaps we're on the cusp of a wave of innovation in how flexible the underlying physical structure is.

I also fear there's a just widespread disdain for mathematical logic among programmers. Without a math background, it takes a long time for one to understand set theory + FOL and relate it to how SQL works, so most just use it as a dumb bit store. The Semantic Web provides hope that the Relational Model will live on in some form, though many still find it scary.

In any case, I think there are many years of debate ahead as to the complexities and architecture of data management in the cloud. It's not as easy as some currently seem to think.

Posted by stu at January 30, 2008 03:11 AM