Sunday, August 28, 2011

Schema Design Matters

One of the big "black holes" in software engineering is a lack of education and discussion of good, large-scale schema design. Virtually every blog you'll see on databases and performance focuses on tunable parameters and other operational issues, with occasional discussion on particular application-level database features.

This is obviously important, but while "tuning" will improve db performance (and a badly tuned database won't perform well at all), in many applications, the performance home runs are in schema issues.

Good schema design is hard, and in many ways counter-intuitive. To design a good large-scale schema, you'll need knowledge of how the database engine organizes data - so an excessively "abstract" approach is a problem - and you need to know what the applications will do with the schema. Simply representing the data and allowing the composition of "correct" queries is not enough.

And worrying about database normalization is far more trouble than it's worth; a large-scale schema will have to have "search" structures and "fact" or "base storage" structures.

An implication of the popularity of tools like Hadoop has been a tendency to walk away from the problem of general schema design, and focus on purpose-built search approaches that are intended to answer particular inquiries. This is necessary in data mining, but hinders more than helps the design of good, large-scale applications.

Much of the focus on schema design has been on "getting the right answer", which is obviously the zeroth requirement of a useful app, but if the queries effectively never return, are they producing the right answer?

In one of my jobs, I was, in a sense, hired due to this problem. The application was collapsing under the weight of input data, and was imperiling the growth of the company. I was actually hired to create a custom data manager for this application. However, as I looked at the database schemas, I realized that, while they were functional and "correct" schemas that were competently done from the perspective of "good" schema design, they were simply not up to the task of fetching the data that was needed to solve the business problem. In particular, if there were more than a few million entities in the database, the data browser app couldn't answer the sort of pseudo-ad-hoc inquiries users wanted to execute quickly enough to be viable.

Since writing this custom data manager would take at least a year - and the performance issues were hindering the growth of the company - I decided to table the custom data manager and deeply investigate and re-architect the schema design. The schema I came up with borrowed ideas from data warehousing, columnar databases, and Hadoop-style search-specific structures. I also minimized dependence on relational joins in favor of star-schema approaches to searching.

As it turned out, the schema redesign has made the application fast enough that the custom data manager is unnecessary.

The redesigned schema allows the app to currently handle over 100,000,000 entities (and about 2 billion database records), can probably handle at least 10x more, and is far faster than it originally was even with a couple million entities. And since we didn't have to throw hardware and software at the problem, the company's operational costs are very low given how much data (and business) is involved. And since we used MySQL, software costs are minimal.

A focus of this blog will be discussion of good schema design ideas, as well as operational issues around maintaining a good "live" schema.

No comments: