zchtodd

Posted on Oct 3, 2019 • Edited on Oct 4, 2019

#SQL30 Day 1: Wildfires

#sql #database #postgres #challenge

Welcome to the SQL showdown series!

What is this and how does it work?

I'm committing to publishing a SQL challenge every day for 30 days. In each post I'll describe my solution to the last day's challenge. I'll follow that up with a full description of the challenge for the next day.

Write your own solution in the comments! Let's see who can come up with the most creative solutions.

I'll add connection details for a PostgreSQL database containing test data for the challenge. Solutions are by no means limited to PostgreSQL, but it's there if you want an easy way to test your query!

Challenge #1: Wildfire Causes

In this challenge we'll use the 1.88 Million Wildfires data-set from Kaggle.

The question is:

What are the details of the largest recorded wildfires (by acreage) for each potential cause of wildfires?

We want the name of the fire, its cause, the state it happened in, the date it was discovered, and the date at which it was considered contained. Each output row should be the largest fire that has ever occurred for that particular cause of wildfire.

The wildfire table has the following columns:

name
cause
acres
disc_date
cont_date
state

Sandbox Connection Details

I have a PostgreSQL database ready for you to play with.

The password is the same as the username! To query the wildfire table:

SELECT * FROM day1.wildfire;

Good luck!

Have fun, and I can't wait to see what you come up with! I'll be back tomorrow with a solution to this problem and a new challenge.

Top comments (8)

Helen Anderson • Oct 4 '19

What a great idea, thanks for setting this up :D

This may be a little convoluted but ....

I would check out how many causes there were first:

select distinct cause
from day1.wildfire
order by cause

-- 13 rows

Then use a Window Function to order the fires by their cause and assign a rank in descending order.

Then use a CTE to isolate the Window Function results and return all rows where the rank = 1.

with rankings as
(select
  fire_name,
  cause,
  disc_date,
  cont_date,
  state,
  acres,
  dense_rank() over(partition by cause order by acres desc) as fire_rank
from 
  day1.wildfire
 order by cause, fire_rank
)

select *
from rankings
where rankings.fire_rank = 1

-- 20 seconds to execute

Then do some checks to make sure I've got the right results

select fire_name, max(acres) as acresmax
from day1.wildfire
where cause  = 'Campfire'
group by fire_name
order by acresmax desc

zchtodd • Oct 4 '19

That's an awesome solution, thanks for writing that. I don't think it's convoluted at all.

In my solution I went without window functions and did a join to a subquery. The interesting thing is that I could potentially join to the wrong row if there were more than 1 fire with the same cause and max acres burned, whereas here I think any ties would end up with the same rank, but it's guaranteed that there's no mixing and matching of data.

I hope you like the rest of the series! I started it because I was one of those developers that thought they knew SQL, when really I knew SQL-92, and modern SQL has so much more to offer.

Sam Mason • Oct 28 '19

bit late to the party! google pointed me to here and it seems interesting, hopefully I can get to the end...

given this is Postgres I'd use the DISTINCT ON operator, as it seems perfectly suited to the task:

SELECT DISTINCT ON (cause) * FROM wildfire ORDER BY cause, acres DESC;

in this case you can also use an IN query:

SELECT *
FROM day1.wildfire
WHERE (cause, acres) IN (
  SELECT cause, MAX(acres) AS acres
  FROM day1.wildfire
  GROUP BY cause
)
ORDER BY acres DESC;

which gives me the same thing. but has different semantics so does different things in the presence of non-unique values. for some reason I thought it could do unexpected things when NULL values were involved, but a quick read suggests that's just NOT IN

Saurabh Sikchi • Oct 26 '19

I think my query is quite a bit easier than WINDOWing and CTE:

SELECT w1.* FROM day1.wildfire w1 INNER JOIN ( SELECT cause, MAX(acres) as max_acres FROM day1.wildfire GROUP BY cause ) w2 ON w1.cause = w2.cause AND w1.acres = w2.max_acres;