The Buffer Pool: October 2011

Saturday, October 29, 2011

MySQL, Postgres, and InnoDB: Some Critiques of Index Organized Tables

This discussion of InnoDB versus Postgres' storage engine (and index organized table structures generally, as this could also apply to Oracle's IOT's) is worth a careful read. Since our schema makes heavy use of the index organized property of InnoDB - and since I have a bit of a fanboy wish to use Postgres as I worked on it for a few years versus helping to buy Larry another yacht - here's some discussion of where index organized structures add value and some criticisms of my own.

Even though one used a primary key definition to define the "index organization structure", in my opinion, the real power of index organized table structures is in larger, "bulk" searches over single-record lookups of the sort one would typically do with a primary key. Assuming very careful schema design and querying, the advantage with "bulk" searches is that a single B-tree ingress is all that is necessary to answer the search, while secondary indexes require at least some indirection into the table, which gets expensive if the working set involves millions of records (as it often does, at least in our case). See the below discussion on telemetry schemas for more.

Index Organized Table Wishlist

Have an explicit "ORGANIZATION KEY" table declaration (or some such) and not rely on the PRIMARY KEY. If the organization key doesn't completely define the storage, allow "..." or something to allow for an incompletely defined organization key. If "ORGANIZATION KEY" is omitted, default to the PRIMARY KEY.
Secondary indexes should have the option of using storage row-ids to "point at" the main rows as organization keys can be relatively long, and require walking two different B-trees to find the recs. Implementing this could be hard and make inserts slow as otherwise unchanged rows can "move around" while the B-tree is having stuff done to it - requiring secondary indexes to have pointers redone - so there's a design trade-off here. In our environment, we mostly avoid secondary indexes in favor of "search tables", which are basically user-managed indexes.

If I get the energy to do an open-source project, I want to do what I call "vector indexes" and a "vectoring execution engine" that we've basically implemented in user code in our app. If we had "vector indexes", we wouldn't be so reliant on index-organized table structures.

Wednesday, October 26, 2011

The joy of MySQL stored procedures

MySQL stored procedures, or stored functions, can be useful, particularly if you need to do something "iterative" with data that can't be easily done with "simple" SQL such as looping through data or (simple) XML or name-value-pair string parsing. First, a few things to know:

1. MySQL stored procedures are dynamically interpreted, so you won't hit syntax or semantic errors in logical blocks until they're actually executed, so you need to create situations in unit-testing where all logic is executed.

2. If stored procedure A calls stored procedure B, and B gets redefined, A will hang unless it gets redefined after B is redefined.

3. Local variables used with the "cursor FETCH" statement shouldn't have the same names as columns used in FETCH queries. If they do, they'll be NULL and random badness will ensue.

Debugging stored procedures

As far as I know, there's no source-level debugger available for MySQL stored procedures or functions, so you have to be a bit clever and a bit old-school in debugging.

The easiest way to debug stored procedures is to use a variation of the "printf" strategy from programming days of yore. Since you can't print an actual message, an alternate way is to create a "debug_output" table with a "debug_txt varchar(128)" column, and use

insert into debug_output values (concat('my message', ',', 'var1=', var1, ..., 'varN=', varN));

After your procedure runs (or dies), just select from your debug_output table. I also like to create an error_output table for error handling with a single column.

Note that these will greatly impact runtime in tightly coded stored procedures, so you'll want to comment them out. I tried to disable them by using the "BLACKHOLE" storage manager, but this doesn't improve performance significantly.

Saturday, October 8, 2011

MySQL strings: oddness with backslash and LIKE

Our application supports wildcard searches of the data, and the data occasionally includes LIKE "special" characters, such as '%', '_', and '\'. One thing we learned recently is that escaping a backslash, which is the LIKE default "escape character" in a LIKE query is particularly troublesome.

Here's some examples. Consider the below simple table:

> insert into foo (a) values ('\\');
> insert into foo (a) values ('\\%');

> select * from foo \G
a: '\'
a: '\%'

Now, do a LIKE search of a single backslash character without a wildcard. To get it searched on, you have to add THREE backslashes to the string, and need a total of four backslashes in the string:

> select * from foo where a like '\\\\' \G
a: '\'

To do a LIKE search of a single backslash and a single "escaped" percent sign, you need six backslashes: four to embed one in the data, and two to escape the percent sign:

> select * from foo where a like '\\\\\\%' \G
a: '\%'

Note that if you use simple equality, you need two backslashes to search on a single backslash.

The Buffer Pool