Wednesday, February 29, 2012

The case for a new open-source relational database

I've been thinking about a new open-source relational database for a while, and here's my thoughts as to why the world needs one, even though PostgreSQL and MySQL would seem to have a lock, with SQLite doing well in small devices.

First, some strengths and weaknesses of each:

1. PostgreSQL is very good at query optimization, and has a large and powerful set of query answering facilities. Where it isn't so wonderful is its storage engine, where the old "versioned" storage manager, complete with its vacuum cleaner - which sucked in 1990 when I worked on it and is still a operational PITA - still resides. It "works", and is serviceable for many applications (particularly large archival apps, where deletes and updates are done rarely or never), but is a huge problem for always-up OLTP apps.

2. OTOH, MySQL has the opposite problem: its parser and executor are lame (gee, will they ever figure out how to cache constant subquery results? How about hashjoins? Type checking? Not defaulting to friggin case-insensitive Swedish as the default collation?), but its open storage architecture allows lots of interesting storage engines, and InnoDB's index-organized structure kicks butt if you know how to use it. We use MySQL/InnoDB for a very large, SaaS database ourselves with both heavy OLTP components and heavy archival search components. To get around its sucky top-level, I have to carefully police all significant queries, make extensive use of optimizer hints, and work with developers to make certain they aren't doing something problematic like using a subquery that is always treated as correlated, etc.

In a dream-world, I could figure out a way to hook up PostgreSQL's top level with the InnoDB storage engine. That would be a powerful relational engine.

That said, there are things neither engine, and no relational db engine in the open-source world that I'm aware of, do very well:

1. Better handling of in-memory tables and in-memory indexes. One thing I did that was very popular in DeviceSQL was an efficient in-memory hashed and ordered index on a persistent table. It is very popular with customers and was extremely fast.
2. Fast bootup, even after ungraceful shutdown. Given that Serious Databases increasingly exist on small, portable devices, fast bootup is needed. Fast bootup has all sorts of implications for transaction handling, so it has to be "designed in" from the beginning.
3. An open storage architecture is a must, both for base storage and for indexes (and the open-storage architecture should allow index-organized base storage as well).
4. "Vector" index query answering facilities. One of the biggest limitations on relational database execution engines today is the inability to use multiple indexes on the same table. A "vector index strategy" would allow large tables to be searched by multiple parameters without needing to "pick a winner" (and often there isn't any good choices to pick...)

More later...