loading...
Cover image for #SQL30 Day 2: Stock Price Variance

#SQL30 Day 2: Stock Price Variance

zchtodd profile image zchtodd ・3 min read

Welcome to the SQL showdown series!

What is this and how does it work?

I'm committing to publishing a SQL challenge every day for 30 days. In each post I'll describe my solution to the last day's challenge. I'll follow that up with a full description of the challenge for the next day.

Write your own solution in the comments! Let's see who can come up with the most creative solutions.

I'll add connection details for a PostgreSQL database containing test data for the challenge. Solutions are by no means limited to PostgreSQL, but it's there if you want an easy way to test your query!

Challenge #2: Stock Price Variance

In this challenge we'll use a small part of the huge stock market data-set from Kaggle. I've loaded the daily trading history for a single company to keep things reasonable.

The question for today is:

Can you output the day's closing price, the closing price from yesterday, the variance between the two, as well as the average closing price for the current month?

The stockprice table has the following columns:

  • day
  • open
  • high
  • low
  • close
  • volume

Here's an example to give you a better idea of the output you're after:

Alt Text

Sandbox Connection Details

I have a PostgreSQL database ready for you to play with.

Alt Text

The password is the same as the username! To query the stockprice table:

SELECT * FROM day2.stockprice;

Solution for Challenge #1

This is the question we were trying to answer with yesterday's SQL challenge:

What are the details of the largest recorded wildfires (by acreage) for each potential cause of wildfires?

At first you might think, let's do a GROUP BY and a sum, right? Job's done boss. Well, almost. We also wanted to return the details for each individual fire.

I'm sure you've seen this issue before if you're an old hand at SQL. The problem is that we can't do this:

SELECT w.*, sum(w.acres) FROM day1.wildfire w GROUP BY cause;

At least in PostgreSQL, we're greeted with this lovely message.

[Code: 0, SQL State: 42803]  ERROR: column "w.fire_name" 
must appear in the GROUP BY clause or be used in 
an aggregate function

This issue is brought up so much that StackOverflow has created the greatest-n-per-group tag just for it.

One possible solution involves joining to a subquery that finds the max value.

SELECT * FROM day1.wildfire w1 JOIN (
SELECT cause, max(acres) AS max_acres FROM day1.wildfire GROUP BY cause
) w2 ON w1.cause = w2.cause AND w1.acres = w2.max_acres 
ORDER BY w1.acres desc;

There are some kind of interesting exceptions to the rules around GROUP BY. Prior to MySQL 8.0, you could arbitrarily select columns that aren't part of an aggregate or the group by clause. Which row does it pull the rest of the columns from? I'm not sure anyone knows.

SQLite will also allow you to do the same, but only if the aggregate is a min or max, in which case the values come from the corresponding row.

Good luck!

Have fun, and I can't wait to see what you come up with! I'll be back tomorrow with a solution to this problem and a new challenge.

Posted on by:

zchtodd profile

zchtodd

@zchtodd

I'm a software developer that loves to write, make cool little apps, and help others learn about programming!

Discussion

pic
Editor guide
 

I've been travelling this weekend so late to the game.

Here's my answer before I go check out day three and to see if we get the same

select
 day,
 close,
 lag(close) over() as last_day,
 close - lag(close) over() as variance,
 date_trunc('month',day)::date as month,
 avg(close) over(partition by date_trunc('month',day)) as monthly_avg
from day2.stockprice
 order by day
 
 

Almost identical to what I ended up doing, with the exception that I did the month extraction inside of a CTE. Using a CTE to avoid repetition of an expression might sometimes be a bad idea, especially with PG, where I believe all CTEs are materialized. So I probably paid dearly for that little convenience!

I hear though that PG 12 has made some improvements in regards to CTE performance.

Yes! I am very excited to be able to use CTEs without fear. The release notes for PG12 are out now - postgresql.org/about/news/1976/

 

I ended up doing the same as Helen and the author. for variety, here's a variant that uses CTEs but EXPLAIN ANALYSE says it's slightly (10%) slower:

WITH monthly AS (
  SELECT date_trunc('month', day) as month, avg(close) AS month_close
  FROM stockprice
  GROUP BY 1
)
SELECT sp.day, sp.close, LAG(sp.close) OVER () AS last_close,
  close - LAG(sp.close) OVER () AS diff_close,
  m.month_close
FROM stockprice sp JOIN monthly m ON date_trunc('month', sp.day) = m.month
ORDER BY sp.day

being from a stats background I found your use of the term "variance" somewhat confusing, I thought you were after something more than just subtraction.