A Summary of Resources for Wildfires for Call for Code

#callforcode #github #wildfires #resource

Introduction

Imagine turning on your television and hearing that Australia is on fire. Sound familiar? Well, yes... it did happen. This ongoing issue of wildfires sparked an initiative for the Wildfires Spot Challenge, where data scientists came together to develop models focused on forecasting wildfires in Australia for the upcoming wildfire season. While wildfires can be dangerous and deadly if they grow out of control, they are essential for the survival of some species. Recent climate change has made the global ecosystem more susceptible to wildfires, especially the areas with warmer and drier weather. Therefore, it is vital to correctly forecast wildfires before they happen so firefighters can prepare and respond accordingly.

As a response, IBM initiated the Call for Code Spot Challenge for Wildfires in November 2020 to invite data scientists around the globe to work on wildfires (bushfires) in Australia. The goal for the challenge is to predict the size of the fire area in km squared by region in Australia, every day from January and February 2021. While the spot challenge for wildfires ended at the beginning of March 2021, wildfires consistently remain an important issue for many of us globally.

In March 2021, the Call for Code Global initiative launched with topics around climate change and green consumption. With that being said, in this post, I will focus on sharing some useful resources regarding wildfires.

Summary of wildfire resources

There are numerous resources on the wildfires out there already but I want to highlight the following comprehensive blog posts and webinars for those who decide to contribute more on this topic.

Blog Posts and Resources

This blog - Call for Code Spot Challenge for Wildfires - goes over the details of the spot challenge for wildfires thoroughly, including the summary, timeline, and materials to help contestants to get started. Susan Malaika, the author, provides numerous resources in this post, please be sure to check out the webinars and other blog posts!
This blog - Call for Code Spot Challenge for Wildfires Predictions: Comparing approaches - compares the top 3 winners' approaches by CRISP-DM frameworks, such as how they prepare data, which machine learning algorithms they picked in their prediction, their results, and so on. Wiktor Mazin, the author, has abundant experience in data science and machine learning and is now a chief data scientist at IBM. He analyzes the challenges and brings readers a unique perspective to the challenge. You can see that although given the very different analysis approaches and time invested in the challenge, they have been very close to each other on the leaderboard throughout the February phases.
This blog - Education materials to get started with the Call for Code Spot Challenge for Wildfires - shares numerous materials to help emerging data scientists get started with the Spot Challenge for Wildfires. The author provided a list of self-study materials for those who are not very familiar with data science already, such as data science methodology, python, and machine learning packages like Scikit-learn, tips to handle time-series data, understand domain knowledge of wildfires, and other ways to build models. Those resources should come in handy when you want to refresh your memory on data science knowledge.
This blog - Predicting Australian Wildfires with Weather Forecast Data - introduces the Spot Challenge and shared possible methods to predict the wildfires in Australia. The best thing is that the author shared a few insights about wildfires and explained the land surface model to help those who are new to how wildfires work and how to properly analyze them.
This blog - The Finale: Call for Code Spot Challenge for Wildfires - announces the winning team for the Call for Code Spot Challenge for Wildfires and includes a screenshot of the final leaderboard
This Github repository includes the 5 datasets that were used in the challenge and 4 Jupyter notebooks that analyze the Australia wildfires with various machine learning models. More details about the repository will be covered in the next section.
This paper from UC Irvine "Machine learning to predict final fire size at the time of ignition" and the associated article "Fighting Fires With Artificial Intelligence" is of interest.

Summary of GitHub repository

I would like to summarize what is included in this GitHub repository, except for the readme file, it is helpful to review the resources and notebooks folder. Lastly, the actual datasets are stored in the data folder.

Within that data folder, there are datasets that were extracted from PAIRS Geoscope and processed to prepare them for the challenge. You can review the following 5 datasets (in CSV format) that were provided (and refreshed) as daily time-series data for the 7 regions in Australia:

Historical wildfires : The dataset contains 10 variables and over 26k data points. It includes fire activities in Australia since 2005, basically the daily aggregation of recorded wildfires in Australia.
Historical Weather : The weather dataset contains 8 variables and over 24k records. It includes the daily aggregation of actual record weather phenomenon in Australia. It was computed from the hourly ERA5 climate reanalysis, and it was collected by European Centre for Medium-Range Weather Forecasts (ECMWF).
Historical weather forecasts : This dataset has 9 variables and over 217k records. This file contains the same variables as the above weather data, but these predicted forecasts and not observations. There is an extra column Lead time that gives the number of days the forecast is valid for. There are 3 different lead time provided which can be used in prediction model building and validation between forecast and actual numbers. This historical weather forecast data will be key to building a model that predicts wildfire areas before they happen!
Historical vegetation index : This dataset has 7 variables and over 1.3k records. The dataset shows the measurement for the greenness of the vegetation that is derived from the satellite images over the 7 regions.
Land classes (static throughout the contest) : The dataset has 15 variables and 7 records. It shows the classification of Australian land classes (static data), basically various land coverage in percentage.

Within that notebook folder, the following 4 Jupyter notebooks are provided, and I would suggest reading them in the below logical order:

wildfire-data-introduction.ipynb
This notebook analyzes the wildfires using a complete data science method using given datasets, and it can be broken down into 4 main sections - (1) Data Introduction (2) Data Exploration (3) Models (4) Next steps.
It loads and explains each of the 5 datasets, and understand datasets with various descriptive analysis. In this notebook, you can distinguish fire and weather conditions region by region in the last decade, along with its vegetation index distributed across the 7 regions. Since most of the datasets are daily time-series, it is essential to aggregate them properly. Before predictive modeling, it demonstrates data distribution to spot possible outliers and runs the correlation among variables. What I like the most about this notebook is it incorporates numerous visualizations to help the readers understand the datasets better.
As the goal of the challenge is to predict future wildfire, linear regression and decision tree regression models are used in prediction, and RMSE and r2 are used to evaluate the models. I think this notebook is a great first step for anyone interested in this topic but not sure where to start.
Wildfire and Weather Data.ipynb
This notebook demonstrates a few great data preprocessing steps, such as dropping duplicated values, handling missing values, cross-verifying data accuracy, various aggregation. After cleaning and preprocessing all 5 datasets, both weather datasets and wildfires datasets are merged to perform further analysis. Like I said before, appropriate aggregation of the time-series data can make future analysis more meaningful. The 2nd part of the notebook visualizes the merged dataset to gain more insights.
This notebook shows detailed steps about how time-series datasets are processed and merged for further modeling that shows in the first notebook. I believe it will be helpful for people who are hesitant about how to properly handle time-series data.
EDA_Wildfire Prediction_22112020.ipynb
This notebook uses the cleaned and merged dataset to visualize the area of the wildfires on a map, over time, and other visualizations. It compares different parameters by time and region using multivariate analysis graphs. There are also scatter plots showing correlation with different degrees of lags.
This notebook has comprehensive exploratory data analysis in visualization, and this will be useful to understand data better and seek relationships among variables.
Hypothesis Testing (Time Series).ipynb
This notebook verifies the assumptions of time-series data is indeed stationary in general. It also explains why it needs to be stationary. There are a few techniques to validate that assumption, yet in this notebook, hypothesis testing is used to validate such a theory. The author used the appliance energy prediction dataset from the UCI machine learning repository. The dataset contains house temperature and humidity conditions monitored with wireless sensors.
This notebook serves as evidence of how the datasets in this repo can be processed, and it should be helpful for anyone who wants to learn more about the statistical methods behind it.

Closing comments

Now you've got an idea of what the Github repository accommodates, if you are looking to contribute to predicting future wildfires and making an impact in global climate changes, here are a few important webinars you can't miss.

A webinar that talked about Predicting Australian Wildfires with Weather Data on April-5th is another great resource. The session shared abundant information you may take advantage of, including the introduction of the wildfire datasets, 3 different data science approaches from the winners of the Call for Code Spot Challenge for wildfires, and university students sharing their experiences with the Spot Challenge, and so on. They invited the 3 winners to the webinars and shared their methods and thought processes. I enjoyed hearing how each team handles the problem creatively, they each have unique ways of processing the data, yet they all improved their predictions. You can watch the webinar replay here!
Or you can register for the future webinar that talks about "Firefighter health platform, a Call for Code Open Source Project" that is coming up on April 26th. In the session, Prometeo created and continues to develop a prototype sensor that sends environmental telemetry processed by AI to monitor firefighter health risk. IBM continues to create new webinars on the crowdcast site to help you and provide developer education. If you have any questions, here is the community to ask questions or discuss any issues: Slack Workspace Channel #cfcsc-wildfires

Finally, if you're looking for more inspiration, I encourage you to participate in this year's Call for Code challenge! The topic for this year are Zero Hunger, Clean water and sanitation, and Responsible production and green consumption. Click on the link to know more and participate!
Commit to the cause. Push for change. Answer the call!
Good luck :)

References to make this blog post happened

National Geographics https://www.nationalgeographic.com/environment/article/wildfires
The Contest GitHub: https://github.com/Call-for-Code/Spot-Challenge-Wildfires
Crowdcast webinar link: https://www.crowdcast.io/e/predicting-australian/register
A great resource for wildfire topics: http://ibm.biz/cfcsc-wildfires
PAIRS GeoScope: https://pairs.res.ibm.com/tutorial/