Saturday, October 29, 2011

MySQL, Postgres, and InnoDB: Some Critiques of Index Organized Tables

This discussion of InnoDB versus Postgres' storage engine (and index organized table structures generally, as this could also apply to Oracle's IOT's) is worth a careful read. Since our schema makes heavy use of the index organized property of InnoDB - and since I have a bit of a fanboy wish to use Postgres as I worked on it for a few years versus helping to buy Larry another yacht - here's some discussion of where index organized structures add value and some criticisms of my own.

Even though one used a primary key definition to define the "index organization structure", in my opinion, the real power of index organized table structures is in larger, "bulk" searches over single-record lookups of the sort one would typically do with a primary key. Assuming very careful schema design and querying, the advantage with "bulk" searches is that a single B-tree ingress is all that is necessary to answer the search, while secondary indexes require at least some indirection into the table, which gets expensive if the working set involves millions of records (as it often does, at least in our case). See the below discussion on telemetry schemas for more.

Index Organized Table Wishlist

  1. Have an explicit "ORGANIZATION KEY" table declaration (or some such) and not rely on the PRIMARY KEY. If the organization key doesn't completely define the storage, allow "..." or something to allow for an incompletely defined organization key. If "ORGANIZATION KEY" is omitted, default to the PRIMARY KEY.
  2. Secondary indexes should have the option of using storage row-ids to "point at" the main rows as organization keys can be relatively long, and require walking two different B-trees to find the recs. Implementing this could be hard and make inserts slow as otherwise unchanged rows can "move around" while the B-tree is having stuff done to it - requiring secondary indexes to have pointers redone - so there's a design trade-off here. In our environment, we mostly avoid secondary indexes in favor of "search tables", which are basically user-managed indexes.
If I get the energy to do an open-source project, I want to do what I call "vector indexes" and a "vectoring execution engine" that we've basically implemented in user code in our app. If we had "vector indexes", we wouldn't be so reliant on index-organized table structures.

Wednesday, October 26, 2011

The joy of MySQL stored procedures

MySQL stored procedures, or stored functions, can be useful, particularly if you need to do something "iterative" with data that can't be easily done with "simple" SQL such as looping through data or (simple) XML or name-value-pair string parsing. First, a few things to know:

1. MySQL stored procedures are dynamically interpreted, so you won't hit syntax or semantic errors in logical blocks until they're actually executed, so you need to create situations in unit-testing where all logic is executed.

2. If stored procedure A calls stored procedure B, and B gets redefined, A will hang unless it gets redefined after B is redefined.

3. Local variables used with the "cursor FETCH" statement shouldn't have the same names as columns used in FETCH queries. If they do, they'll be NULL and random badness will ensue.

Debugging stored procedures

As far as I know, there's no source-level debugger available for MySQL stored procedures or functions, so you have to be a bit clever and a bit old-school in debugging.

The easiest way to debug stored procedures is to use a variation of the "printf" strategy from programming days of yore. Since you can't print an actual message, an alternate way is to create a "debug_output" table with a "debug_txt varchar(128)" column, and use

insert into debug_output values (concat('my message', ',', 'var1=', var1, ..., 'varN=', varN));

After your procedure runs (or dies), just select from your debug_output table. I also like to create an error_output table for error handling with a single column.

Note that these will greatly impact runtime in tightly coded stored procedures, so you'll want to comment them out. I tried to disable them by using the "BLACKHOLE" storage manager, but this doesn't improve performance significantly.

Saturday, October 8, 2011

MySQL strings: oddness with backslash and LIKE

Our application supports wildcard searches of the data, and the data occasionally includes LIKE "special" characters, such as '%', '_', and '\'. One thing we learned recently is that escaping a backslash, which is the LIKE default "escape character" in a LIKE query is particularly troublesome.

Here's some examples. Consider the below simple table:

> insert into foo (a) values ('\\');
> insert into foo (a) values ('\\%');

> select * from foo \G
a: '\'
a: '\%'

Now, do a LIKE search of a single backslash character without a wildcard. To get it searched on, you have to add THREE backslashes to the string, and need a total of four backslashes in the string:

> select * from foo where a like '\\\\' \G
a: '\'

To do a LIKE search of a single backslash and a single "escaped" percent sign, you need six backslashes: four to embed one in the data, and two to escape the percent sign:

> select * from foo where a like '\\\\\\%' \G
a: '\%'

Note that if you use simple equality, you need two backslashes to search on a single backslash.

Thursday, September 22, 2011

Partition pruning and WHERE clauses

If you're using lots of partitions, you'll want to make sure that partition pruning is used, and you aren't always guaranteed that it will be. We had a situation where we have a set of very large (potentially >1B rows, currently ~250M rows) tables that are partitioned on "Unix days" (Unix time rounded to days), that have several hundred partitions, using "hash" partitioning to simplify maintenance. So, to simplify, our table looks something like

create table mytable (id integer, unix_day mediumint, primary key (id, unix_day**)) engine=innodb partition by hash (unix_day) partitions 500;

Let's load up a with 1B records. Assuming even distribution, we'll have 2M records in each partition. Now, let's run the following query, to fetch everything in the past 3 days (ignoring timezones for simplicity):

set @today = unix_timestamp(now()) div 86400;

select id from mytable
where unix_day >= @today - 3;

Since we're using "hash" partitioning, this will end up being a monster table scan on the entire table, taking hours to complete. The problem is this query can't be "pruned" because "hash" partitioning doesn't have an implied order.

However, if we run this query

select id from mytable where unix_day in (@today-3, @today - 2, @today - 1, @today);

we'll get good pruning and only four partitions will be searched. Note that MySQL is also good at exploding small integer ranges (ie, ranges less than the number of partitions) into what amounts to an IN-list for partition pruning purposes, so the above query, which is difficult to code in an app, can be converted to

select id from mytable where unix_day between @today - 3 and @today;

and only four partitions will be searched.

**Note that this "stupid" PRIMARY KEY declaration is necessary to have InnoDB let us use unix_day as a partition key, as partition keys must be part of an existing key (primary or otherwise). Making it a separate KEY - and additional index - would be hugely expensive and unnecessary in our app.

Wednesday, September 21, 2011

InnoDB versus MyISAM: large load performance

The conventional wisdom regarding InnoDB versus MyISAM is that InnoDB is faster in some contexts, but MyISAM is generally faster, particularly in large loads. However, we ran an experiment in which a large bulk load of a mysqldump output file, which is basically plain SQL consisting of some CREATE TABLE and CREATE INDEX statements, and a whole lot of huge INSERT statements, in which InnoDB was the clear winner over MyISAM.

We have a medium-sized database that we have to load every month or so. When loaded, the database is about 6GB in MyISAM, and about 11G in InnoDB, and has a couple hundred million smallish records. The MySQL dump file itself is about 5.6G, and had "ENGINE=MyISAM" in its CREATE TABLE statements that we "sed" substituted to "ENGINE=InnoDB" to do the InnoDB test.

Load time with ENGINE=MyISAM: 203 minutes (3 hours 23 minutes)
Load time with ENGINE=InnoDB: 40 minutes

My guess is that the performance difference is due to the fact that these tables have lots of indexes. Every table has a PRIMARY KEY, and at least one secondary index. InnoDB is generally better at large index loads than MyISAM in our experience, so the extra time MyISAM spends doing index population swamps its advantage in simple load time to the base table storage.

Given our experimental results, we'll now use InnoDB for this table.

Saturday, September 10, 2011

Big Data: "Telemetry" schemas

Many database schemas have similar characteristics, and one common - and important - schema type is what I call the "telemetry" schema. Telemetry schemas have certain things in common:

1. They have a small number of relatively simple tables. I'll call these the "telemetry log tables".
2. They have a time component, and are often queried using the time component.
3. They have at least one score or amount associated with the data, which is also queried frequently.
4. Telemetry log tables are often very large, with hundreds of millions or billions of records.
5. While there may be summary tables that are updated frequently, records in the telemetry log tables are rarely or never updated.

Examples of telemetry schemas include:

1. Actual telemetry data from various types of sensors.
2. Wall street-style trading data.
3. Other transactional data, such as banking activity, point-of-sale activity, etc.

As mentioned above, telemetry schemas often have a time series component, but also need to be queried in interesting ways using approaches other than the simple time component.

Telemetry schemas pose several challenges to standard schema design strategies:

1. Telemetry log tables are typically too large for relational joins, even when well-indexed, to perform well, particularly on "bulk" searches that visit many records.
2. Most database engines will "anchor" a search on one index, use it to fetch records out of the base table, and finish qualifying the search using values from these records, even if other indexes exist on the qualifying rows. This search strategy will perform awfully with huge tables, unless the engine gets lucky and picks a highly selective index, or the search happens to be on a very selective value in the index. Statistics-based query optimizers can help some, but can still "whiff badly", and this can result in what I call the database "death penalty" for queries: a badly-optimized "loser" query that takes days to run.

In a future blog post, I'll talk about a search strategy I've successfully used for a large telemetry schemas.

Tuesday, September 6, 2011

INNODB: When do partitions improve performance?

PARTITIONs are one of the least understood tools in the database design toolkit.

As an index-organized storage mechanism, InnoDB's partitioning scheme can be extremely helpful for both insert/load performance and read performance in queries that leverage the partition key.

In InnoDB, the best way to think of a partitioned table is as a hash table with separate B-trees in the individual partitions, with the individual B-trees structured around the PRIMARY KEY (as is always the case with InnoDB-managed tables). The partition key is the "bucketizer" for the hash table.

Since the main performance issue in loading data into an InnoDB table is the need to figure out where to put new records in an existing B-Tree structure, a partition scheme that allows new data to go into "new" partitions, with "short" B-trees, will dramatically speed up performance. Also, the slowest loads will be bounded by the maximum size of an individual partition.

A partition scheme I generally like to use is one with several hundred partitions, driven by "Unix days" (ie, Unix time divided by 86400 seconds). Since data is typically obsoleted or archived away after a couple of years, this will allow at most a couple of days of data in a partition.

The partition key needs to be explicitly used in the WHERE clauses of queries in order to take advantage of partition pruning. Note that even if you can figure out how to derive the partition key from other data in the WHERE clause, the query optimizer often can't (or at least isn't coded to do so), so it's better to explicitly include it in the WHERE clause by itself, ANDed with the rest of the WHERE.

Some notes on using MySQL partitions:

1. Secondary indexes won't typically go much faster. Shorter B-trees will help some, but not all that much, particularly since it's hard to do both partition pruning and use a secondary index against the same table in a single query.
2. In MySQL, don't create a secondary index on the partition key, as it will effectively "hide" the partition key from MySQL and result in worse performance than if the partition is directly used.
3. Partitions can't be easily "nested". While you can use a computed expression to map your partition in more recent versions of MySQL, it is difficult for MySQL to leverage one well for the purposes of partition pruning. So, keeping your partitions simple and just using them directly in queries is preferable.
4. If you use "Unix day" as a partition, you'll want to explicitly include it in a query, even if you already have a datetime value in the query and the table. This may make queries look slightly ugly and redundant, but they'll run a lot faster.
5. The biggest "win" in searching from partitioning will be searches on the PRIMARY KEY that also use the partition key, allowing for partition pruning. Also, searches that otherwise would result in a table scan can at least do a "short" table scan only in the partitions of interest if the partition key is used in the query.

Saturday, September 3, 2011

MySQL 5.1 to 5.5 upgrade

We are in the midst of a MySQL 5.1 to MySQL 5.5 upgrade. Nobody upgrades the database engine on a running production system without a Darn Good Reason, but we had recently run into one with a bug in MySQL 5.1 that was exploding MySQL RAM use on our production server and threatening to crash the production database.

Ultimately, the problem was we wanted to use "statement batching" with UPDATE, which produces a large number of UPDATE statements in a single long string, dramatically improving performance. After a dialog with MySQL's QA people, it turned out that we were hitting a bug in 5.1's query parser in which a lot of RAM was allocated for every statement, and not freed until the entire set of queries in the query text was parsed. So, this wasn't, strictly speaking, a "memory leak", but still an open-ended memory allocation which caused the RAM assigned to MySQL to grow enormously.

5.5 appears to have solved this problem, and we're well along in our acceptance process. Other than a couple of very small issues with syntax that was deprecated in 5.1, but now an actual error in 5.5, all is going well. We'll go live shortly.

Tuesday, August 30, 2011

MySQL: Fun with MySQL query logs

The MySQL "general query log" is very useful for situations where you're trying to figure out and improve or debug a large, complex app. The problem we often run into is that Hibernate logs, JBOSS logs, or other app-level logs miss queries or don't put constant data into queries, or the app itself has a bug in it that results in queries not being logged correctly or at all.

On the other hand, MySQL's "general query log" captures all queries sent to the server when the


system variable is set to 'ON'. A useful way to capture a particular application dialog is to run the app on a lightly-trafficked database instance, preferably an isolated test instance, enable the general log with

mysql> set global general_log = 'on';

quickly do your operation, and then disable the general log with

mysql> set global general_log = 'off';

After this, look at the log. It will have lines starting with a time, a connection ID, some sort of code indicating what happened, and the query text. Figure out which connection ID(s) are interesting, use "grep" on the log to get the lines belonging to the connection IDs of interest, and redirect this output into a SQL file. This can then be edited down to the actual queries (if you remember to put a semicolon at the end of every line), and you can make a simple script that can be run using mysql's shell-level tools.

These are very useful for debugging, benchmarking of the database accesses done by web or GUI-driven applications, or other SQL-based testing.

Sunday, August 28, 2011

PRIMARY KEYs in INNODB: Choose wisely

PRIMARY KEYs in InnoDB are the primary structure used to organize data in a table. This means the choice of the PRIMARY KEY has a direct impact on performance. And for big datasets, this performance choice can be enormous.

Consider a table with a primary search attribute such as "CITY", a secondary search attribute "RANK", and a third search attribute "DATE".

A simple "traditional" approach to this table would be something like

create table myinfo (city varchar(50),
rank float,
info_date timestamp,
id bigint,
primary key (id)
) engine=innodb;

create index lookup_index
on myinfo (city, rank, info_date);

InnoDB builds the primary table data store in a B-tree structure around "id", as it's the primary key. The index "index_lookup" contains index records for every record in the table, and the primary key of the record is stored as the "lookup key" for the index.

This may look OK at first glance, and will perform decently with up to a few million records. But consider how lookups on myinfo by a query like

select * from myinfo where city = 'San Jose' and rank between 5 and 10 and date > '2011-02-15';

are answered by MySQL:

1. First, the index B-tree is walked to find the records of interest in the index structure itself.
2. Now, for every record of interest, the entire "primary" B-tree is walked to fetch the actual record values.

This means that N+1 B-trees are walked for N result records.

Now consider the following change to the above table:

create table myinfo (city varchar(50),
rank float,
info_date timestamp,
id bigint,
primary key (city, rank, info_date, id)
) engine=innodb;

create index id_lookup on myinfo (id);

The primary key is now a four-column primary key, and since "id" is distinct, it satisfies the uniqueness requirements for primary keys. The above query now only has to walk a single B-tree to be completely answered. Note also that searches against CITY alone or CITY+RANK also benefit.

Let's plug in some numbers, and put 100M records into myinfo. Let's also say that an average search returns 5,000 records.

Schema 1: (Index lookup + Primary Key lookup from index):
Lg (100M) * 1 + 5000 * Lg (100M) = 132903 B-tree operations.

Schema 2: (Primary Key lookup only):
Lg(100M) * 1 = 26 B-tree operations. (Note that this single B-tree ingress operation will fetch 5K records)

So, for this query, Schema 2 is over 5,000 times faster than Schema 1. So, if Schema 2 is answered in a second, Schema 1 will take nearly two hours.

Note that we've played a bit of a trick here, and now lookups on "ID" are relatively expensive. But there are many situations where a table identifier is rarely or never looked up, but used as the primary key as "InnoDB needs a primary key".

See also Schema Design Matters

Development in a large-data environment

One of the biggest problems with application development in the context of "Big Data" is that the developer gets the database-interacting code to "work" in the developer's "playpen" database, but the code collapses when it's put into production. A related, but even more serious problem - since it won't be found as quickly - is code that works early, but has very bad degradation as the production database grows.

There's a couple approaches to this problem:

1. Have the typical "small database" for initial coding and debugging.
2. Have a large developer playpen database. It should be a large fraction of the size of the production database, or if the production database is small, it should contain contrived (but logically consistent) data that is a large fraction of the expected size of the production database.

Developers should unit-test against both databases before committing their changes.

Schema Design Matters

One of the big "black holes" in software engineering is a lack of education and discussion of good, large-scale schema design. Virtually every blog you'll see on databases and performance focuses on tunable parameters and other operational issues, with occasional discussion on particular application-level database features.

This is obviously important, but while "tuning" will improve db performance (and a badly tuned database won't perform well at all), in many applications, the performance home runs are in schema issues.

Good schema design is hard, and in many ways counter-intuitive. To design a good large-scale schema, you'll need knowledge of how the database engine organizes data - so an excessively "abstract" approach is a problem - and you need to know what the applications will do with the schema. Simply representing the data and allowing the composition of "correct" queries is not enough.

And worrying about database normalization is far more trouble than it's worth; a large-scale schema will have to have "search" structures and "fact" or "base storage" structures.

An implication of the popularity of tools like Hadoop has been a tendency to walk away from the problem of general schema design, and focus on purpose-built search approaches that are intended to answer particular inquiries. This is necessary in data mining, but hinders more than helps the design of good, large-scale applications.

Much of the focus on schema design has been on "getting the right answer", which is obviously the zeroth requirement of a useful app, but if the queries effectively never return, are they producing the right answer?

In one of my jobs, I was, in a sense, hired due to this problem. The application was collapsing under the weight of input data, and was imperiling the growth of the company. I was actually hired to create a custom data manager for this application. However, as I looked at the database schemas, I realized that, while they were functional and "correct" schemas that were competently done from the perspective of "good" schema design, they were simply not up to the task of fetching the data that was needed to solve the business problem. In particular, if there were more than a few million entities in the database, the data browser app couldn't answer the sort of pseudo-ad-hoc inquiries users wanted to execute quickly enough to be viable.

Since writing this custom data manager would take at least a year - and the performance issues were hindering the growth of the company - I decided to table the custom data manager and deeply investigate and re-architect the schema design. The schema I came up with borrowed ideas from data warehousing, columnar databases, and Hadoop-style search-specific structures. I also minimized dependence on relational joins in favor of star-schema approaches to searching.

As it turned out, the schema redesign has made the application fast enough that the custom data manager is unnecessary.

The redesigned schema allows the app to currently handle over 100,000,000 entities (and about 2 billion database records), can probably handle at least 10x more, and is far faster than it originally was even with a couple million entities. And since we didn't have to throw hardware and software at the problem, the company's operational costs are very low given how much data (and business) is involved. And since we used MySQL, software costs are minimal.

A focus of this blog will be discussion of good schema design ideas, as well as operational issues around maintaining a good "live" schema.

Monday, May 23, 2011

Loud heroes versus boring guys

One thing that I've always found interesting about the software and IT world is the focus on drama and heroics over "getting it right". There is an odd tendency to praise and value "heroes" who put out frequent fires over setting up processes that minimize the number of fires in the first place.

Having often been such a "hero", it definitely feels good to ride in on the white charger and save the day every so often, but even with good practices, you'll have plenty of opportunity for software heroics, particularly in the startup environments I typically inhabit. Changing requirements, new customers, changing business strategies, and other external forces will create plenty of drama and chances for heroism without the engineering team creating its own.

Often, boring is preferable to loud, even if the latter is more fun.

The problem with columnar databases

Columnar databases are an interesting idea. They're databases where the major data storage structure is a column, not a row (as is the case in the vast majority of relational database engines). A "row" in a "table" in a columnar database is materialized from the columns using join-like associations.

Columnar databases offer lots of interesting storage possibilities, such as native bitmap storage, that aren't easily doable with row-major storage, in which rows are stored contiguously on storage pages.

The problem for schema design with columnar databases is that you often want to qualify the column with a couple of other data elements from the table, such as a date.

Another, more practical problem is that few people know enough about columnar databases to design schemas that leverage their capabilities well.

Friday, May 20, 2011

Some simple schema design mistakes

1. Not paying attention when determining the primary key, particularly if using index-organized primary storage structures like InnoDB or Oracle's "ORGANIZATION INDEX" tables.

2. Thinking that defining an index on multiple columns means that each column is "indexed". Multi-column indexes give you a search structure that is organized from the leftmost column to the rightmost column in the CREATE INDEX (or KEY()) statement, so if the leftmost column(s) aren't used in a given query, columns in the "middle" can't be used.