How many sales did we make this month?
One way of answering this question is to find the total sum of sales that we have made in the current calendar month:
SELECT COUNT(*) FROM sales
WHERE DATE_TRUNC('month', sale_date) = DATE_TRUNC('month', NOW())
This query returns the number of rows in the sales
table that their sale_date
value is in the current calendar month.
Similarly we can count up the sales that we made in the previous calendar month:
SELECT COUNT(*) FROM sales
WHERE DATE_TRUNC('month', sale_date) = DATE_TRUNC('month', NOW() - '1 month' :: INTERVAL)
This query returns the number of rows in the sales
table that their sale_date
value is within the previous calendar month.
Depending on which day of the month today is, you may notice that last month we had considerably more sales than this month. We can of course explain it because we are always somewhere the middle of the current month and this month has not yet ended.
We can avoid this problem all together if interpret the meaning of the current month as the past 30 days as opposed to a calendar month:
SELECT COUNT(*) FROM sales
WHERE sale_date >= NOW() - '1 month' :: INTERVAL
In this post, I argue that this interpretation is more suitable for data visualization and trend analysis.
We also explore some techniques for generating sample data sets and working with SQL window functions.
Generating Data
Let's start by generating a (albeit boring) sample data set.
We use PostgreSQL's generate_series(lower, upper, step)
handy function to create our samples. For example this snippet generates a time series of hourly interval between a year ago and now:
SELECT * FROM generate_series(NOW(), NOW() - '1 year' :: INTERVAL , '-1 hour') AS sale_date;
Here we assume that we have had one sale every hour at regular intervals in the past year (this assumption is what makes this data set dull, nevertheless we will check a more advanced data generation technique later one) but it is enough for us to make a point about varying month lengths.:
WITH sales AS (
SELECT generate_series(NOW(), NOW() - '1 year' :: INTERVAL , '-1 hour') AS sale_date
)
SELECT
DATE_TRUNC('month', sale_date) AS sale_month
, EXTRACT(DAY FROM DATE_TRUNC('month', sale_date + '1 month' :: INTERVAL) - DATE_TRUNC('month', sale_date)) AS month_length
, COUNT(*)
FROM sales
GROUP BY sale_month, month_length
ORDER BY sale_month
WITH
expressions are called Common Table Expressions or CTEs. We use them to encapsulate our logic.
I avoid creating temporary tables in this post, that's where CTEs come handy.
Here WITH sales AS ...
makes a virtual, sales
table with one column: sale_date
, this table lives in the memory during the execution of the query.
This is the result of the query (try it in SQL Fiddle)
+------------+--------------+---------+
| sale_month | month_length | count |
|------------+--------------+---------|
| 2018-08-01 | 31 | 659 |
| 2018-09-01 | 30 | 720 |
| 2018-10-01 | 31 | 745 |
| 2018-11-01 | 30 | 720 |
| 2018-12-01 | 31 | 744 |
| 2019-01-01 | 31 | 744 |
| 2019-02-01 | 28 | 672 |
| 2019-03-01 | 30 | 743 |
| 2019-04-01 | 30 | 720 |
| 2019-05-01 | 31 | 744 |
| 2019-06-01 | 30 | 720 |
| 2019-07-01 | 31 | 744 |
| 2019-08-01 | 31 | 86 |
+------------+--------------+---------+
Ignoring first and last months, the number of sales in every month is directly related to the number of days in that month. February is usually problematic.
To improve our analysis, let's take advantage of our other interpretation of 'month' as the past 30 days.
The following query finds the average number of sales that we had in the past 30 days, for every day in the data set:
WITH sales AS (
SELECT generate_series(NOW(), NOW() - '1 year' :: INTERVAL , '-1 hour') AS sale_date
)
,
daily_sales AS (
SELECT
DATE_TRUNC('day', sale_date) AS sale_day
, COUNT(*) AS sales
FROM sales
GROUP BY sale_day
ORDER BY sale_day
)
,
daily_avgs as (
SELECT
sale_day
, SUM(sales) OVER W
, AVG(sales) OVER W
FROM daily_sales
WINDOW W as (ORDER BY sale_day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)
)
SELECT * FROM daily_avgs
ORDER BY sale_day DESC
There's a lot to unpack here, but let's first check the result:
+------------+-------+------+
| sale_day | sum | avg |
|------------+-------+------|
| 2019-08-05 | 710 | 23.7 |
| 2019-08-04 | 720 | 24.0 |
| 2019-08-03 | 720 | 24.0 |
| 2019-08-02 | 720 | 24.0 |
| 2019-08-01 | 720 | 24.0 |
| 2019-07-31 | 720 | 24.0 |
| 2019-07-30 | 720 | 24.0 |
| 2019-07-29 | 720 | 24.0 |
| 2019-07-28 | 720 | 24.0 |
...
The sum in every row is 24 * 30 = 720
, but the latest row. The problem is of course because today is not finished yet. We never have full 24 hours of today in the data set.
Note that we first created a daily_sales
CTE:
daily_sales AS (
SELECT
DATE_TRUNC('day', sale_date) AS sale_day
, COUNT(*) AS sales
FROM sales
GROUP BY sale_day
ORDER BY sale_day
)
Which is basically a time series of number of sales that we had every day.
We are running statistics on daily_sales
CTE in the next CTE (daily_avgs
):
daily_avgs as (
SELECT
sale_day
, SUM(sales) OVER W
, AVG(sales) OVER W
FROM daily_sales
WINDOW W as (ORDER BY sale_day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)
WINDOW W as (ORDER BY sale_day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)
creates a window (named W
) that is 30-row wide. We could have named our window anything like (W30
or my_window
or whatever) I choose a simple one letter W
because we only have one window in this query.
Here we say for each row in our data set, select sale_day
of the row, and sum
all the sales
that occurred between the sale_day
s that are 29 rows earlier from the current row and including the current row (that is 30 rows in total).
We can roughly translate this window expression into a C-like language as below:
const indexed_daily_sales = daily_sales
.sortBy(r => r.sale_day) // ORDER BY sale_day
.map((r, index) => ({...r, index}))
foreach(row in indexed_daily_sales) {
yield indexed_daily_sales
.filter(r => r.index >= row.index - 29 and r.index <= row.index) // ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
.sum(x => x.sales) // SUM(sales)
// similarly for AVG
}
If you want to exclude today's data from the result, filter out today at the end.
And here's the result visualized in a time series:
SQL Window Function
Let's explore the concept of windows in SQL a bit more.
If I want to sum up all the integers between 1 and 10, I use this query:
SELECT SUM(id) FROM generate_series(1, 10) id
The result is simply a row with one number: 55.
But if I want to find the sum of all the integers from 1 to N for every N <= 10, I use:
SELECT id, SUM(id) OVER (ORDER BY id) FROM generate_series(1, 10) id
The result has 10 rows:
+------+-------+
| id | sum |
|------+-------|
| 1 | 1 | = 1
| 2 | 3 | = 1 + 2
| 3 | 6 | = 1 + 2 + 3
| 4 | 10 | = 1 + 2 + 3 + 4
| 5 | 15 | = 1 + 2 + 3 + 4 + 5
| 6 | 21 | = 1 + 2 + 3 + 4 + 5 + 6
| 7 | 28 | = 1 + 2 + 3 + 4 + 5 + 6 + 7
| 8 | 36 | = 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8
| 9 | 45 | = 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9
| 10 | 55 | = 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10
+------+-------+
The sum field in every row corresponds to the sum of the integers from 1 to id
of the row.
Here OVER (ORDER BY id)
creates a window of all the rows up and including the current row. It is equivalent to: OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
We don't have to name our window if we use it only once.
SELECT
SUM(id) OVER (ORDER BY id)
FROM ...
is equivalent to:
SELECT
SUM(step_size) OVER W
FROM ...
WINDOW W AS (ORDER BY id)
Let's modify this query to return the sum of three integers around each number, for example the row for 5 should return: 4 + 5 + 6 = 15
SELECT
id
, SUM(id) OVER (ORDER BY id ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
FROM generate_series(1,10) id
+------+-------+
| id | sum |
|------+-------|
| 1 | 3 | = + 1 + 2 = 3
| 2 | 6 | = 1 + 2 + 3 = 6
| 3 | 9 | = 2 + 3 + 4 = 9
| 4 | 12 | = 3 + 4 + 5 = 12
| 5 | 15 | = 4 + 5 + 6 = 15
| 6 | 18 | = 5 + 6 + 7 = 18
| 7 | 21 | = 6 + 7 + 8 = 21
| 8 | 24 | = 7 + 8 + 9 = 24
| 9 | 27 | = 8 + 9 + 10 = 27
| 10 | 19 | = 9 + 10 + = 19
+------+-------+
The size of the window is 3 with the exception of the first and last items in the result.
This is usually the case that the size of the window in the beginning and/or the end of the result is smaller than in the middle.
Aggregate functions that operate on windows are aware of this. For example check the average of the three neighboring numbers (that must be the middle one):
SELECT
id
, SUM(id) OVER W
, AVG(id) OVER W
FROM generate_series(1, 10) id
WINDOW W AS (ORDER BY id ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
+------+-------+-----+
| id | sum | avg |
|------+-------+-----|
| 1 | 3 | 1.5 | = ( 1 + 2) / 2 = 1.5 (note division by 2)
| 2 | 6 | 2.0 | = (1 + 2 + 3) / 3 = 2.0 (note division by 3)
| 3 | 9 | 3.0 | = (2 + 3 + 4) / 3 = 3.0
| 4 | 12 | 4.0 | = (3 + 4 + 5) / 3 = 4.0
| 5 | 15 | 5.0 | = (4 + 5 + 6) / 3 = 5.0
| 6 | 18 | 6.0 | = (5 + 6 + 7) / 3 = 6.0
| 7 | 21 | 7.0 | = (6 + 7 + 8) / 3 = 7.0
| 8 | 24 | 8.0 | = (7 + 8 + 9) / 3 = 8.0
| 9 | 27 | 9.0 | = (8 + 9 + 10) / 3 = 9.0
| 10 | 19 | 9.5 | = (9 + 10 ) / 2 = 9.5 (note division by 2)
+------+-------+-----+
Depending on your analysis you might need to take this fact into consideration.
For us it means that the average of sales in our time series in the beginning has less data points, hence AVG(sales) OVER W
would be more noisy at the left of our charts.
One easy workaround is to ignore the data points in the beginning by offsetting our result. Use OFFSET 29
at the end the query (SQL Fiddle).
Random Walk
We need a more realistic data set in order to put everything that we discussed above into use. Our sales data set has been very boring so far because we used a uniform distribution (one sale every hour) to create our sample data set.
Here we explore a method for generating a more realistic sales data set.
Random Walk is exactly what you think it is. In two dimension, you can think of a small turtle on a surface that chooses the direction of her next step completely by random.
In one dimension we can only move up or down.
SELECT step_id, (FLOOR((RANDOM() * 3) - 1)) AS step_size
FROM generate_series(1,6000) step_id
+-----------+-------------+
| step_id | step_size |
|-----------+-------------|
| 1 | 1.0 |
| 2 | 0.0 |
| 3 | 1.0 |
| 4 | 1.0 |
| 5 | 0.0 |
| 6 | -1.0 |
...
This snippet generates a uniform distribution of -1s, 0s and 1s step_size
s.
These numbers model the movement of our cursor down, up or not at all at each step.
The sum of all the previous step_size
s at each step determines the total distance that we have travelled form the origin.
SELECT
SUM(step_size) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pos
FROM (
SELECT step_id, floor((random() * 3) - 1) AS step_size
FROM generate_series(1,6000) step_id
) _a
+-----------+-------------+-------+
| step_id | step_size | pos |
|-----------+-------------|-------|
| 1 | 1.0 | 1.0 |
| 2 | 0.0 | 1.0 | = 1 + 0
| 3 | 1.0 | 2.0 | = 1 + 0 + 1
| 4 | 1.0 | 3.0 | = 1 + 0 + 1 + 1
| 5 | 0.0 | 3.0 | = 1 + 0 + 1 + 1 + 0
| 6 | -1.0 | 2.0 | = 1 + 0 + 1 + 1 + 0 - 1
...
Use UNION
if you want to specify a starting point:
SELECT 0 as step_id, 600 as step_size
UNION
SELECT step_id, floor((random() * 3) - 1) AS step_size
FROM generate_series(1,6000) step_id
ORDER BY step_id
We use random walk to generate more realistic-looking data set. The idea here is that the number of sales every day is not completely random, but it is actually close to the sales that we had on the previous day plus or minus some random value which we call noise.
WITH noise AS (
SELECT
step_id
, date_trunc('day', NOW()) - (step_id || ' day') :: INTERVAL AS sale_day
, SUM(step_size) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pos
FROM (
SELECT step_id, (FLOOR((RANDOM() * 3) - 1)) * FLOOR(RANDOM() * 100) AS step_size
FROM generate_series(1,1000) step_id
) _a
)
,
daily_sales AS (
SELECT
sale_day
, (CASE WHEN EXTRACT(DAY FROM sale_day) < 8
THEN FLOOR(RANDOM() * 200)
ELSE 0 END
) + (SELECT ABS(MIN(pos)) FROM noise) + pos AS sales
FROM noise
ORDER BY step_id DESC
)
SELECT
sale_day
, sales
, AVG(sales) OVER (ORDER BY sale_day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS avg_daily_sales
FROM daily_sales
noise
is a series of random numbers between -99 and 99:
SELECT * FROM noise
+-----------+------------+--------+
| step_id | sale_day | pos |
|-----------+------------+--------|
| 1 | 2019-08-04 | 48.0 |
| 2 | 2019-08-03 | 48.0 |
| 3 | 2019-08-02 | 48.0 |
| 4 | 2019-08-01 | 72.0 |
| 5 | 2019-07-31 | 72.0 |
| 6 | 2019-07-30 | 159.0 |
| 7 | 2019-07-29 | 252.0 |
...
In this model, we expect our sales to vary ±99 from the previous day by random.
daily_sales
adjusts the noise series by first making sure that all data points are above zero: + (SELECT ABS(MIN(pos)) FROM noise)
. We also adds some seasonality to the series (we assume our sales increase in the first week of the month by maximum of 200 sales per day):
(CASE WHEN EXTRACT(DAY FROM sale_day) < 8
THEN FLOOR(RANDOM() * 200)
ELSE 0 END
)
Let's check the final result:
+------------+---------+-------------------+
| sale_day | sales | avg_daily_sales |
|------------+---------+-------------------|
| 2016-11-08 | 1074.0 | 1074.0 |
| 2016-11-09 | 1068.0 | 1071.0 |
| 2016-11-10 | 1118.0 | 1086.66666666667 |
| 2016-11-11 | 1118.0 | 1094.5 |
| 2016-11-12 | 1112.0 | 1098.0 |
| 2016-11-13 | 1177.0 | 1111.16666666667 |
| 2016-11-14 | 1145.0 | 1116.0 |
| 2016-11-15 | 1117.0 | 1116.125 |
...
A sanity check:
avg_daily_sales
at 2016-11-11
= (1074 + 1068 + 1118 + 1118) / 4 = 1,094.5
The maximum number of rows that we have in any window is 30. In the beginning we of course we have less than 30 rows in the window, because there is no row before 2016-11-08
.
Compare our latest chart with our original attempt with naïve DATE_TRUNC
.
And if you prefer monthly statistics, filter out the rows at the end of the query:
...
daily_avgs AS (
SELECT
sale_day
, sales
, SUM(sales) OVER (ORDER BY sale_day ROWS BETWEEN 29 PRECEDING AND
CURRENT ROW) AS avg_monthly_sales
FROM daily_sales
)
SELECT sale_day, avg_monthly_sales
FROM daily_avgs
WHERE EXTRACT(DAY FROM sale_day) = EXTRACT(DAY FROM NOW() - '1 day' :: INTERVAL)
ORDER BY sale_day DESC
Each bar represents the sum of sales during 30-days before and including the date that is associated with the bar.
Missing Values
Often we need to deal with gaps in our time series. In our example, rows in daily_sales
would be missing if there is no sales happened on that day.
So far in our generated data set we avoided zero and negative sales by padding the data with the absolute value of the minimum distance that was generated by our random walk.
In order to produce some gaps, let's first decrease this padding to half + (SELECT ABS(ROUND(MIN(pos)/2)) FROM noise)
and then filter out all the rows with zero or negative sales: SELECT * FROM daily_sales_1 WHERE sales > 0
:
WITH noise AS (
SELECT
step_id
, DATE_TRUNC('day', NOW()) - (step_id || ' day') :: INTERVAL as sale_day
, SUM(step_size) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pos
FROM (
SELECT 0 as step_id, 0 as step_size
UNION
SELECT step_id, (floor((random() * 3) - 1)) * floor(random() * 100) AS step_size
FROM generate_series(1,1000) step_id
ORDER BY step_id
) _a
)
,
daily_sales_1 AS (
SELECT
sale_day
, (CASE WHEN EXTRACT(DAY FROM sale_day) < 8
THEN floor(random() * 200)
ELSE 0 END
) + (SELECT ABS(ROUND(MIN(pos)/2)) FROM noise) + pos AS sales
FROM noise
ORDER BY step_id DESC
)
,
daily_sales AS (
SELECT * FROM daily_sales_1 where sales > 0
)
,
calendar AS (
SELECT generate_series(
(SELECT min(sale_day) from daily_sales_1)
, (SELECT max(sale_day) from daily_sales_1)
, '1 day' :: INTERVAL
) as sale_day
)
SELECT
calendar.sale_day
, COALESCE(sales, 0) AS sales
, AVG(COALESCE(sales, 0)) OVER (ORDER BY calendar.sale_day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS avg_daily_sales
FROM calendar
LEFT JOIN daily_sales ON calendar.sale_day = daily_sales.sale_day
This is the result:
calendar
CTE generates a series of dates
calendar AS (
SELECT generate_series(
(SELECT min(sale_day) from daily_sales_1)
, (SELECT max(sale_day) from daily_sales_1)
, '1 day' :: INTERVAL
) as sale_day
)
We select sale_day
s from calendar
and left join it with daily_sales
table in the final step.
sales
are null
for the rows that are missing in the daily_sales
table because of left join. That's why we use COALESCE(sales, 0)
to cast nulls to 0 for the missing data points.
In general null does not mean 0. So be careful. But we can cast nulls or missing data points to 0 when we are dealing with gaps in a time series.
Top comments (0)