My workshop material for SQL

2024-02-29

This article is the blog version of the material that I deliver for SQL workshops. My target audience usually has some familiarity with SQL, so this material skips over the basics.

My general goal with these workshops is to convince the analysts in my audience that they aren't screwed if they don't have Python. SQL is much more powerful than most people realize.

The following sections aren't necessarily meant to be a cohesive whole. Depending on the context of my audience, I select sections to deliver as modules under an overarching theme.

Start with FROM

You may have been thinking about SQL wrong.

If you have written any SQL at all, you know that most read queries grammatically start with SELECT. Among experienced users of SQL, this language design choice is widely considered to be a mistake.

SELECT
    first_name, last_name, dob
FROM customer
WHERE dob < '2005-02-28';

This is a language design mistake because it gives newbies the wrong idea of how SQL works. SQL actually physically executes in a similar way to how you would transform some data structure in a programming language. Any data transformation must begin with the source. In SQL, this source is described by the FROM clause. Once you have specified your source data, you can then do things like filter it (WHERE), aggregate it (GROUP BY), and map it (SELECT).

My all-time favorite article on this subject is this post by Julia Evans. In this post, she describes that the basic components of SQL happen in this order:

Thinking about SQL in this way makes more sense. Of course you have to start a transform with your source data. That's how it works in every other data-processing language, and it's no different here.

Realize as well that this group of clauses yields a result set. A result set is a data source. You can use the results of one query as the data source of another. I have found that this is where things begin to click for my workshop audiences.

Common table expressions

It is the job of your data engineers to make analytic queries as simple as possible. Sometimes, though, your clients will ask for something that the data model doesn't support at face value -- for example, using data from two different tables that were previously thought to be unrelated.

We had this exact scenario at one of my consulting engagements. The project we were supporting was a loyalty campaign where customers could earn points by transacting. These points could then be redeemed for rewards.

Our data platform had two main fact tables: transaction and redemption. I had to merge these two into a single fact table point_event when my client's stakeholders started asking for insights about the redemption rate of points. To greatly oversimplify, after some tweaks to the original fact tables, I defined point_event as such:

SELECT * FROM "transaction"
UNION
SELECT * FROM "redemption"
ORDER BY "eventTimestamp";

Now, point_event could be used in analytic queries. The most straightforward way to use the result of a query as the source of another is simply to use the first query in the FROM clause of the second. This technique is called "subquerying."

SELECT
    SUM(CASE WHEN "eventType" = 'transaction' THEN 1 ELSE 0 END) AS "transactionCount"
FROM (
    SELECT * FROM "transaction"
    UNION
    SELECT * FROM "redemption"
    ORDER BY "eventTimestamp"
);

If you have many such subqueries, though, it is often more convenient to organize them linearly and by giving them names. These are called "common table expressions" (CTEs). If you're a programmer, then if subqueries are like nesting functions, CTEs are like storing results in intermediate variables.

WITH point_event AS (
    SELECT * FROM "transaction"
    UNION
    SELECT * FROM "redemption"
    ORDER BY "eventTimestamp"
)
SELECT
    SUM(CASE WHEN "eventType" = 'transaction' THEN 1 ELSE 0 END) AS "transactionCount"
FROM point_event;

After a CTE has been defined, you can use it as a data source for a query. Future CTEs also have access to past CTEs.

WITH point_event AS (
    SELECT * FROM "transaction"
    UNION
    SELECT * FROM "redemption"
    ORDER BY "eventTimestamp"
),
denormalized_point_event AS (
    SELECT * FROM point_event
    LEFT JOIN customer ON
    point_event.customer_id = customer.id
)
SELECT
    customer.province,
    SUM(CASE WHEN "eventType" = 'transaction' THEN 1 ELSE 0 END) AS "transactionCount"
FROM denormalized_point_event
GROUP BY customer.province;

Knowing how to sling around result sets like this almost makes SQL feel like a normal programming language.

Window functions

Most of you are familiar with aggregate functions. These functions take many values and compress them into one.

SELECT
    COUNT(*)
FROM app_transaction;

Most of you are also familiar with groups. Groups tell aggregate functions to partition, or bucket, their calculations. With groups, instead of operating on the whole set of rows, aggregate functions operate on subsets of rows.

SELECT
    customer_id,
    COUNT(*)
FROM app_transaction
GROUP BY customer_id;

The trouble with aggregate functions is that they reduce the detail of your table. This might not be a problem when you're creating reports, but sometimes, you need to expose the data of an aggregation to the detail rows for intermediate calculations.

You can use window functions for this. Window functions are just functions that are aware of a number of rows other than the current row. The other rows are defined with the aforementioned "window."

The classic example of a window function is on a table of employees and their salaries. If we are interested in getting the average salary of all employees, we can simply do something like this with an aggregate function:

SELECT
    AVG(salary)
FROM employee;

We know, though, that this will collapse the entire table into a single row. That's what aggregate functions do. If we instead want to add the average salary as another column in our result set, we can use AVG as a window function instead of an aggregate function:

SELECT
    *,
    AVG(salary) OVER ()
FROM employee;

The extra syntax here, OVER (), is the window. The details of the window are usually written out within the parens. By default, empty parens mean that the window is the whole result set.

Let's do something a little more useful. If we instead wanted to get the average salary per department, the normal aggregate function query would look like this:

SELECT
    department,
    AVG(salary)
FROM employee
GROUP BY department;

If we instead want to make the average salary of each individual employee's department available to each row in the employee table, we use a window function again. This time, since we have groups, we tell the window to group by department. In a window function, the syntax to group is usually PARTITION BY.

SELECT
    *,
    AVG(salary) OVER (PARTITION BY department)
FROM employee;

I do feel that this standard example undersells the utility of window functions. Window functions are most interestingly used to compare a row against other rows in its partition. For example, let's see how much an employee is being paid relative to other employees in their department.

SELECT
    *,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC)
FROM employee;

Window functions are just functions that are aware of a number of rows other than the current row. In this case, RANK() would be aware of every other employee within the current employee's department. We can additionally tell the window to order itself by a column, which is necessary here for a ranking to make sense.

My personal favorite use case for window functions is in time series data. I once had to get add a column to a transactions table for the time since the customer's previous transaction. This is easy if you know about the LAG window function, which gets the value of a column at a specified offset from the current row within the window. It's really difficult if you don't.

Takeaways

I'll summarize my workshop's theses about SQL here for my own convenience.

PS