loading...
Cover image for #SQL30 Day 1: Wildfires

#SQL30 Day 1: Wildfires

zchtodd profile image zchtodd Updated on ・2 min read

Welcome to the SQL showdown series!

What is this and how does it work?

I'm committing to publishing a SQL challenge every day for 30 days. In each post I'll describe my solution to the last day's challenge. I'll follow that up with a full description of the challenge for the next day.

Write your own solution in the comments! Let's see who can come up with the most creative solutions.

I'll add connection details for a PostgreSQL database containing test data for the challenge. Solutions are by no means limited to PostgreSQL, but it's there if you want an easy way to test your query!

Challenge #1: Wildfire Causes

In this challenge we'll use the 1.88 Million Wildfires data-set from Kaggle.

The question is:

What are the details of the largest recorded wildfires (by acreage) for each potential cause of wildfires?

We want the name of the fire, its cause, the state it happened in, the date it was discovered, and the date at which it was considered contained. Each output row should be the largest fire that has ever occurred for that particular cause of wildfire.

The wildfire table has the following columns:

  • name
  • cause
  • acres
  • disc_date
  • cont_date
  • state

Sandbox Connection Details

I have a PostgreSQL database ready for you to play with.

Alt Text

The password is the same as the username! To query the wildfire table:

SELECT * FROM day1.wildfire;

Good luck!

Have fun, and I can't wait to see what you come up with! I'll be back tomorrow with a solution to this problem and a new challenge.

Posted on by:

zchtodd profile

zchtodd

@zchtodd

I'm a software developer that loves to write, make cool little apps, and help others learn about programming!

Discussion

markdown guide
 

bit late to the party! google pointed me to here and it seems interesting, hopefully I can get to the end...

given this is Postgres I'd use the DISTINCT ON operator, as it seems perfectly suited to the task:

SELECT DISTINCT ON (cause) * FROM wildfire ORDER BY cause, acres DESC;

in this case you can also use an IN query:

SELECT *
FROM day1.wildfire
WHERE (cause, acres) IN (
  SELECT cause, MAX(acres) AS acres
  FROM day1.wildfire
  GROUP BY cause
)
ORDER BY acres DESC;

which gives me the same thing. but has different semantics so does different things in the presence of non-unique values. for some reason I thought it could do unexpected things when NULL values were involved, but a quick read suggests that's just NOT IN

 

What a great idea, thanks for setting this up :D

This may be a little convoluted but ....

I would check out how many causes there were first:

select distinct cause
from day1.wildfire
order by cause

-- 13 rows

Then use a Window Function to order the fires by their cause and assign a rank in descending order.

Then use a CTE to isolate the Window Function results and return all rows where the rank = 1.

with rankings as
(select
  fire_name,
  cause,
  disc_date,
  cont_date,
  state,
  acres,
  dense_rank() over(partition by cause order by acres desc) as fire_rank
from 
  day1.wildfire
 order by cause, fire_rank
)

select *
from rankings
where rankings.fire_rank = 1

-- 20 seconds to execute

Then do some checks to make sure I've got the right results

select fire_name, max(acres) as acresmax
from day1.wildfire
where cause  = 'Campfire'
group by fire_name
order by acresmax desc
 

That's an awesome solution, thanks for writing that. I don't think it's convoluted at all.

In my solution I went without window functions and did a join to a subquery. The interesting thing is that I could potentially join to the wrong row if there were more than 1 fire with the same cause and max acres burned, whereas here I think any ties would end up with the same rank, but it's guaranteed that there's no mixing and matching of data.

I hope you like the rest of the series! I started it because I was one of those developers that thought they knew SQL, when really I knew SQL-92, and modern SQL has so much more to offer.

 

I think my query is quite a bit easier than WINDOWing and CTE:

SELECT w1.*
FROM day1.wildfire w1
INNER JOIN ( SELECT cause, MAX(acres) as max_acres FROM day1.wildfire GROUP BY cause ) w2
ON w1.cause = w2.cause AND w1.acres = w2.max_acres;

 

Nice! I think of window functions as the heavy artillery... not always necessary when a group by might do the job.