What makes a good dataset?

2024-09-23

I run a data engineering consultancy. It's my job to make datasets good. This is an overview of what I do for my clients when they get me and my team on retainer.

Observations

The first thing I check on a new project is whether the dataset is composed of business events. In practice, these business events are usually things like sales, purchases, sign-ups, bookings, enrollments, incidents, deliveries, and more. It's a broad but prescriptive word that I think is reasonably clear.

This is what data engineers and scientists mean when they look for "observations." Each row in a good dataset should be an observation of some kind of event.

Each row in a good dataset should also be of a similar granularity, or scope, as the other rows in the dataset. If one row in a sales dataset represents a whole receipt, then the other rows in the dataset should also represent a whole receipt. Good datasets do not mix rows that represent single receipts and rows that represent whole days.

It is also best if the granularity of each row is as small as possible. In a sales dataset, the smallest event is usually a line item, not a receipt. No matter the domain, it is best to check if your dataset is really as granular as it can be. Question your assumptions.

Metrics

Each observation in a dataset usually has some sort of number attached to it. In a sales dataset, per line item, these numbers might be things like quantities and subtotals. These numbers are what we call "metrics," or sometimes "measures."

Metrics should be additive. This just means that it should make sense for you to add metrics between rows. Bad datasets have non-additive metrics, for example, pre-computed averages. You can't do much with non-additive metrics.

Dimensions

Some of you might be wondering whether datasets of non-events like people or places are bad datasets. They're not bad in themselves, but they're most useful as context attached to events. In a good sales dataset, each line item row will usually have context attached to it like who the customer was, who the cashier was, which branch the transaction occurred in, what product was sold, and when the transaction occurred. These supplementary datasets are what we call "dimensions." They are the basis of filters and aggregations, which we will discuss in more detail later.

In some datasets, dimensional data is placed right beside the observation (or "fact", as observations are sometimes called). Even though it duplicates data, this makes the dataset very simple to work with. If space is a concern, dimensional data is sometimes split off from the fact and is referenced with a foreign key. The schema of analytic datasets is supposed to be simple, so good datasets will typically try their hardest to limit the distance each dimension is from the main dataset to at most one join. If dimensions are beside the fact, the schema is called "one-big-table"; if dimensions are one join away, the schema is called a "star schema." Anything beyond this is considered undesirable.

A good dataset's dimensions are internally consistent. Bad datasets have inconsistencies everywhere. Most dimensional data comes in the form of strings (text). Computers do not tolerate typos in strings well. Most analytical systems will (rightfully) consider the product categories "Food and Beverages" and "F&B" to be completely different. This problem is known as "data duplication" and is extremely difficult to solve downstream from the source of the dataset. It is possible, but it is imperfect and prohibitively expensive.

Good dimension data is atomic. This just means that it should not be possible to break dimension data into multiple parts. This is how you can tell whether your dataset's granularity is too large. We know by now that a whole receipt is too large to use for a sales dataset. A dataset that uses whole receipts as its granularity would need lists of products as its product dimension, which is a hint that the dataset is too coarse. Breaking sales data into line items allows us to use atomic products as the product dimension. You can perform transformations like this using your data tool's explode/unnest/flatten function.

Filters and aggregations

If your dataset has a consistent granularity, has additive metrics, and has consistent dimensions, then it is ready to be used in analytics. By far, the most common operations on data are filtering and aggregation. Filters include or exclude rows based on certain conditions. Aggregations take a subset of rows and apply an aggregation function, or a function that takes many inputs and returns one output, to the subset. Most often, these aggregation functions are SUM, AVG, COUNT, MIN, and MAX, though there are others.

Dimensions are so important because they serve as the basis for filters and aggregations. In a sales dataset, you will typically have to answer questions like "how many units of vinegar from brand XYZ did we sell last week in all our stores in Southeast Asia, grouped by country?" A well-designed dataset will allow analysts to easily slice and dice on all these different conditions and requirements. This example might have dimensional data for product, brand (via product), time, store, country (via store), and region (via store). It should be trivial to apply filters to product, brand, time, and region, and aggregations to country. It should also be trivial to change the filters and aggregations if needed.

This is why data professionals prefer rows of smaller granularity. You can always get aggregates from rows that are too fine, but you can never retrieve data from rows that are too coarse. Interestingly, some datasets are intentionally made too coarse precisely because coarseness hides detail that the publisher of the dataset might want to protect.

Long and wide

Many of you are likely familiar with pivot tables. These are tables that have a dimension on the x-axis, a second dimension on the y-axis, and an aggregation done on the metrics of rows that match each combination of dimension. In a sales dataset, this might be something like each store's (y-axis) sales (aggregation) per month (x-axis).

Pivot tables might make fine reports, but they are bad datasets. Pivot tables break the first guideline for good datasets: each row should be an observation of an event. This is unfortunate, because a surprising number of "datasets" are published in pivot format.

The opposite of the pivot format is the "long" format, where each value in both axes is transformed into a dimensional value. The pivot format is sometimes called the "wide" format to contrast it against the "long" format. This is difficult to explain with words, so here's an illustration. This is a table in pivot format:

|         | January | February |
|---------|---------|----------|
| Store A | 400     | 350      |
| Store B | 200     | 250      |

This is a table in long format:

| store   | month    | sales |
|---------|----------|-------|
| Store A | January  | 400   |
| Store A | February | 350   |
| Store B | January  | 200   |
| Store B | February | 250   |

Long formats are much preferred for datasets which must be used in further analysis. As much as possible, columns should not vary. Variance should be left to row values.

Sometimes, if you receive a dataset in wide format, it is possible to transform it into a long format with your data tool's "unpivot" or "melt" function.

Unique values

One common thing to ask of datasets is how many unique values there are. In a sales dataset, this can come up as "how many unique users have ever purchased from each of our brands?"

Unique values are not problematic on their own, but take care when using a dataset that has a unique value count as one of its metrics. These metrics are not necessarily additive. If your business allows customers to purchase from more than one brand, a customer might contribute to the unique customer count of multiple brands. It does not make sense to add the counts of unique customers per brand in this case.

Caveats

What I describe here is merely an ideal you can measure datasets against. My team and I are well aware that the real world prevents this ideal from being reached very often, if at all.