Saturday, September 10, 2011

Big Data: "Telemetry" schemas

Many database schemas have similar characteristics, and one common - and important - schema type is what I call the "telemetry" schema. Telemetry schemas have certain things in common:

1. They have a small number of relatively simple tables. I'll call these the "telemetry log tables".
2. They have a time component, and are often queried using the time component.
3. They have at least one score or amount associated with the data, which is also queried frequently.
4. Telemetry log tables are often very large, with hundreds of millions or billions of records.
5. While there may be summary tables that are updated frequently, records in the telemetry log tables are rarely or never updated.

Examples of telemetry schemas include:

1. Actual telemetry data from various types of sensors.
2. Wall street-style trading data.
3. Other transactional data, such as banking activity, point-of-sale activity, etc.

As mentioned above, telemetry schemas often have a time series component, but also need to be queried in interesting ways using approaches other than the simple time component.

Telemetry schemas pose several challenges to standard schema design strategies:

1. Telemetry log tables are typically too large for relational joins, even when well-indexed, to perform well, particularly on "bulk" searches that visit many records.
2. Most database engines will "anchor" a search on one index, use it to fetch records out of the base table, and finish qualifying the search using values from these records, even if other indexes exist on the qualifying rows. This search strategy will perform awfully with huge tables, unless the engine gets lucky and picks a highly selective index, or the search happens to be on a very selective value in the index. Statistics-based query optimizers can help some, but can still "whiff badly", and this can result in what I call the database "death penalty" for queries: a badly-optimized "loser" query that takes days to run.

In a future blog post, I'll talk about a search strategy I've successfully used for a large telemetry schemas.

No comments: