Tuesday, July 6, 2010

Proposal for an open-source database

While there are numerous open-source databases out there, there isn't one that specializes in the areas I'm most interested in exploring, or the areas I see the biggest weaknesses in as far as database usage is concerned.

I. Some ideas I'd like to explore:

1. A "not only SQL" database, but one which still allows full access via a query language.

2. A hybrid of relational table structures and "structured document" access and storage. I'll define "structured documents" shortly, but in general they are largish objects with key-value-pair behavior, but - unlike databases like Berkeley DB - they have engine-visible internal structure beyond the key and can be queried and internally manipulated using the query language.

3. A notion of transactional "ACID-ity", allowing for more interesting low-level storage approaches. A well-defined notion of transaction with predictable semantics and meaningful locking, visibility, and crash recovery is required for any database that will be used for anything resembling mission-critical data, but I want to have the freedom to implement the "structured documents" in such a way that they aren't overly limited by the need for "global COMMIT".

4. As mentioned above, there'll be a query language. It will even resemble SQL in many ways, although it won't attempt to be compatible, except in areas where SQL syntax makes as much sense as any other.

5. I haven't yet decided whether it'll start out fully client-server, or whether I'll implement it as a application-resident C library in the same fashion as SQLite and "server-ize" it later.

II. My "Design Aesthetic"

1. The "embedded" approach

Embedded or small-device databases is where I've spent most of my recent career, so the "embedded aesthetic" of favoring efficiency and control over lots of abstraction and heavy use of third-party tools will be a design principle. But I'm hoping that the database can be used for larger systems as well.

2. Few abstractions, but "good" ones.

Database engines lend themselves well to some obvious abstractions and layering, while being extremely unforgiving to "whiteboardist" approaches.

III. What the heck is a "structured document"?

A structured document is a fancy name for a blobby name-value pair thing, in which the "value" has internal structure that is visible to the database engine. The structure will be in the form of "records" that can be inserted to, deleted from, or queried in the document. Also, the document will have dynamic indexes that will be built when the doc is loaded into memory so it can be searched efficiently - and directly, without needing a join. The entire contents of a structured document can also be fetched into the application so it can be loaded quickly and conveniently for apps interested in doing so. Also, as mentioned above, there'll be a simple relational capability, so that rows in a structured doc *can* be joined against relational table data, etc.

IV. The type system

This database will support the standard SQL datatypes, as well as user-defined "record" types. Nested and ordered records *may* be allowed (not sure yet).

V. Other ideas

1. Internal hierarchies in structured documents

Structured docs won't have syntax-level support for hierarchical representations, but there will be enough infrastructure so that apps can implement hierarchical structures - such as XML docs - if they so desire. I'm still thinking about this one though - supporting nested records does imply support for hierarchies. Note that if I do full nesting and hierarchies, all record contents - including nested contents - will be stored contiguously in the doc as link-chasing is Evil, both semantically and in terms of implementation.

2. Documents will be relatively free-form.

Documents will be members of a query-able set, and all documents in a set must have a similarly-structured key, as well as a small set of other records that all documents are "required" to have, but, beyond that, docs can have records of a particular type added without needing to change the definition of all records. The query language will be sufficiently permissive to allow traversal even if not all docs have a particular type of record.

3. At the storage level, database pages are "owned" by documents, so the design expectation is that they're relatively large (ie, >4K per document). Also, the storage system will attempt to store pages belonging to particular documents contiguously, and there'll be a way for the application to declare how much space to initially assign to a particular doc (or doc-set) in advance when a new doc is created so the trade-off between space efficiency and contiguous storage can be managed by the app developer.

VI. It's still rather half-baked, ain't it?

Yes, it is. As I think about this more, I'll obviously have to rigorize the above very loosey-goosey descriptions into something resembling a spec before I start coding.

Heck, I haven't even named the thing yet! Any help here is appreciated...

No comments: