What makes a good dataset?

2024-09-23

I run a data engineering consultancy. It's my job to make datasets good. This is an overview of what I do for my clients when they get me and my team on retainer.

Facts

In most cases, businesses are interested in understanding facts, which we will define, for our purposes, as "things that happened."

Facts are events which have already occurred. They are concerned with the past, not the present, and not the future. Thus, the realities around a fact can never be changed. For us, the name of the sitting president of the USA is not a fact. The answer to that question can change. However, the name of the president of the USA on May 15, 1789 -- George Washington -- is a fact. The answer to that question, barring divine or supernatural intervention, will never change for the rest of time.

This is why I do not consider something like an organization's directory of members to be a basis for a good dataset for analytics. The membership of that directory can change at any time. A dataset of join dates (and perhaps departure dates) is a much stronger basis for an analytic dataset on organization membership. Rows in good analytic datasets should be bound first to time and second to its measures and context, which we will discuss shortly.

In practice, facts that businesses are interested in are usually things like sales, purchases, sign-ups, bookings, enrollments, incidents, deliveries, and more. It is common for businesses to want to analyze facts related to revenue, costs, or both.

Each fact in a good dataset should also be of the same granularity, or scope, as the other rows in the dataset. If one row in a sales dataset represents a whole receipt, then the other rows in the dataset should also represent a whole receipt. Good datasets do not mix rows that represent single receipts and rows that represent whole days.

It is also best if the granularity of each fact is as small as possible. For example, in a sales dataset, the smallest fact is usually a line item, not a receipt. No matter the domain, it is best to check if your dataset is really as granular as it can be. Question your assumptions.

Metrics

Each observation in a dataset usually has some sort of number attached to it. In a sales dataset, per line item, examples of such numbers might be things like quantities and subtotals. These numbers are what we call "metrics," or sometimes "measures." Metrics are the basis for measuring how important a fact is relative to other facts and external benchmarks.

Metrics should be additive. This just means that it should make sense for you to add metrics between rows. Bad datasets have non-additive metrics, for example, pre-computed averages. You can't do much with non-additive metrics.

Dimensions

In an earlier section, I asserted that datasets of non-events like people or places are a bad basis for analytic datasets. This is because they're most useful as secondary context attached to facts. In a good sales dataset, each line item row will usually have context attached to it like who the customer was, who the cashier was, which branch the transaction occurred in, what product was sold, and when the transaction occurred. These supplementary datasets are what we call "dimensions." They are the basis of filters and aggregations, which we will discuss in more detail later.

In some datasets, dimensional data is placed right beside the fact. Even though it duplicates data, this makes the dataset very simple to work with. If space is a concern, dimensional data is sometimes split off from the fact and is referenced with a foreign key. The schema of analytic datasets is supposed to be simple, so good datasets will typically try their hardest to limit the distance each dimension is from the main dataset to at most one join. If dimensions are beside the fact, the schema is called "one-big-table"; if dimensions are one join away, the schema is called a "star schema." Anything beyond this is considered undesirable.

A good dataset's dimensions are internally consistent. Bad datasets have inconsistencies everywhere. Most dimensional data comes in the form of strings (text). Computers do not tolerate typos in strings well. Most analytical systems will (rightfully) consider the product categories "Food and Beverages" and "F&B" to be completely different. This problem is known as "data duplication" and is extremely difficult to solve downstream from the source of the dataset. It is possible, but it is imperfect and prohibitively expensive.

Good dimension data is atomic. This just means that it should not be possible to break dimension data into multiple parts. This is how you can tell whether your dataset's granularity is too large. We know by now that a whole receipt is too large to use for a sales dataset. A dataset that uses whole receipts as its granularity would need lists of products as its product dimension, which is a hint that the dataset is too coarse. Breaking sales data into line items allows us to use atomic products as the product dimension. You can perform transformations like this using your data tool's explode/unnest/flatten function.

Filters and aggregations

If your dataset has a consistent granularity, has additive metrics, and has consistent dimensions, then it is ready to be used in analytics. By far, the most common operations on data are filtering and aggregation. Filters include or exclude rows based on certain conditions. Aggregations take a subset of rows and apply an aggregation function, or a function that takes many inputs and returns one output, to the subset. Most often, these aggregation functions are SUM, AVG, COUNT, MIN, and MAX, though there are others.

Dimensions are so important because they serve as the basis for filters and aggregations. In a sales dataset, you will typically have to answer questions like "how many units of vinegar from brand XYZ did we sell last week in all our stores in Southeast Asia, grouped by country?" A well-designed dataset will allow analysts to easily slice and dice on all these different conditions and requirements. This example might have dimensional data for product, brand (via product), time, store, country (via store), and region (via store). It should be trivial to apply filters to product, brand, time, and region, and aggregations to country. It should also be trivial to change the filters and aggregations if needed.

This is why data professionals prefer rows of smaller granularity. You can always get aggregates from rows that are too fine, but you can never retrieve data from rows that are too coarse. Interestingly, some datasets are intentionally made too coarse precisely because coarseness hides detail that the publisher of the dataset might want to protect.

Long and wide

Many of you are likely familiar with pivot tables. These are tables that have a dimension on the x-axis, a second dimension on the y-axis, and an aggregation done on the metrics of rows that match each combination of dimension. In a sales dataset, this might be something like each store's (y-axis) sales (aggregation) per month (x-axis).

Pivot tables might make fine reports, but they are bad datasets. Pivot tables break the first guideline for good datasets: each row should be an observation of an event. This is unfortunate, because a surprising number of "datasets" are published in pivot format.

The opposite of the pivot format is the "long" format, where each value in both axes is transformed into a dimensional value. The pivot format is sometimes called the "wide" format to contrast it against the "long" format. This is difficult to explain with words, so here's an illustration. This is a table in pivot format:

|         | January | February |
|---------|---------|----------|
| Store A | 400     | 350      |
| Store B | 200     | 250      |

This is a table in long format:

| store   | month    | sales |
|---------|----------|-------|
| Store A | January  | 400   |
| Store A | February | 350   |
| Store B | January  | 200   |
| Store B | February | 250   |

Long formats are much preferred for datasets which must be used in further analysis. As much as possible, columns should not vary. Variance should be left to row values.

Sometimes, if you receive a dataset in wide format, it is possible to transform it into a long format with your data tool's "unpivot" or "melt" function.

Unique values

One common thing to ask of datasets is how many unique values there are. In a sales dataset, this can come up as "how many unique users have ever purchased from each of our brands?"

Unique values are not problematic on their own, but take care when using a dataset that has a unique value count as one of its metrics. These metrics are not necessarily additive. If your business allows customers to purchase from more than one brand, a customer might contribute to the unique customer count of multiple brands. It does not make sense to add the counts of unique customers per brand in this case.

Caveats

What I describe here is merely an ideal you can measure datasets against. My team and I are well aware that the real world prevents this ideal from being reached very often, if at all.

joeilagan.com