Tony Hung

Posted on Jul 7, 2021

Predicting the price of Heating Oil using PyCaret

In this notebook, we'll go over how to perform a time series forecasting on the price of heating oil.

Background

In update New York, we are unable to get natural gas to service our home for heating. This is because the rock is mostly made of shale, which makes it tough to get pull natural gas. So we have to rely on oil for heat.

As the price of gas goes up and down, the price of heating oil is the same. There are many companies that distribute oil and all of them have different prices. Also, these prices change between seasons. So, I wanted to build an application that predicts the price of oil, so that I know when is the best time to buy.

Data

Before building the models, i needed to get data. I have been unable to find a dataset with the current price of oil, therefore I had to build my own. The website cheapestoil.com shows the price of oil for many companies in the northeast United States. The site shows the latest prices for these companies, but they do not show the previous prices.

So in order to get the prices, I build a AWS lambda function that scrapes the price of oil daily. I used AWS CloudWatch events to run a lambda function every 12 hours, in order to fetch the prices for that time. This lambda extracted the last updated date, price and supplier, and save these results as JSON and save to an S3 bucket. After the json data is saved, I have another lambda function, attached as a trigger, to read each json file, and save into DynamoDB. Please see this GitHub repo for more detail on how I build these lambda functions.

I started this project back in Dec 2020 in order to build up my dataset. The lambda function has been running for about 6 months, and I have a decent amount of data to work with. In order to expand my dataset, I was able to pull more data using The Wayback Machine on web.archive.org. The Wayback Machine stores a snapshot of many pages on the internet. It doesn't have every site, but it did have some snapshots from cheaptestoil.com. To get that data, I used https://github.com/hartator/wayback-machine-downloader to download the archive data. The archive only had 7 snapshots, between the dates of Aug 2020 and Oct 2021.

In all, I have about 5k records of all the oil prices from the website

Fetching Saved Data

I used DynamoDB to store the oil price data, and used Boto3 to fetch the data, which I then save to a CSV.

import boto3
from boto3.dynamodb.conditions import Key
import pandas as pd
dynamodb = boto3.resource('dynamodb')

table = dynamodb.Table('heating_oil_prices')
response = table.scan()
df = pd.DataFrame(response["Items"])
df.to_csv("data.csv")

def get_data():
    df = pd.read_csv("data.csv", usecols=["last_updated", "price150","price500", "price300", "supplier", "state"])
    df["last_updated"] = pd.to_datetime(df["last_updated"])
    df = df.set_index("last_updated")
    df["state"] = df["state"].apply(lambda x: "NewYork" if str(x) == "nan" else x)
    df = df.sort_index()
    return df
df = get_data()

On Cheapestoil.com, the have the price of oil in gallons, but the price is slightly different for how many gallons you buy. If we get suppliers in the state of New York, we'll see the following

    state = "NewYork"
    suppliers_by_state = df[ (df["state"] == state)].dropna()
    suppliers_by_state.iloc[1]

price500          1.449
price300          1.469
price150          1.549
supplier    Suffolk Oil
state           NewYork
Name: 2020-08-03 15:11:16, dtype: object

The row above shows that the price for 500 gallons(price500) is $1.449 per gallon, 300 gallons(price300) is $1.69 and 150 gallons(price150) is $1.549

Lets see how many suppliers we have for New York

suppliers_by_state["supplier"].value_counts().sum()

Since we have so many suppliers, a forecasr for the average price of oil for all the suppliers might be a better way to go, since the prices are simlar between every company

df = df.reset_index()
data = df[ df["state"] == state].resample('d', on='last_updated').mean().dropna()
data = data.reset_index()

Here is the mean price of oil over all the suppliers in New York.

data.head()

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	last_updated	price500	price300	price150
0	2020-08-03	1.672000	1.719400	1.76140
1	2020-08-04	1.632429	1.625750	1.68075
2	2020-10-29	1.852500	1.819455	1.85300
3	2020-11-11	1.786250	1.759000	1.76900
4	2020-11-12	1.420000	1.460000	1.54000

For a quick check on our data, let's plot the price for 500 gallons

data["price500"].plot()

<AxesSubplot:>

Model

Now, we can start building our model. We'll be using PyCaret to build our time series forecast.
Before modeling, we need to update the dataset to remove the date and replace with numeric values. To do this, I've included fastai's add_datepart function to convert the data is series of features, split by year, month, day, day of week, and much more

from fastai.tabular.all import *
add_datepart(data, field_name="last_updated")

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	price500	price300	price150	last_updatedYear	last_updatedMonth	last_updatedWeek	last_updatedDay	last_updatedDayofweek	last_updatedDayofyear	last_updatedIs_month_end	last_updatedIs_month_start	last_updatedIs_quarter_end	last_updatedIs_quarter_start	last_updatedIs_year_end	last_updatedIs_year_start	last_updatedElapsed
0	1.672000	1.719400	1.761400	2020	8	32	3	0	216	False	False	False	False	False	False	1.596413e+09
1	1.632429	1.625750	1.680750	2020	8	32	4	1	217	False	False	False	False	False	False	1.596499e+09
2	1.852500	1.819455	1.853000	2020	10	44	29	3	303	False	False	False	False	False	False	1.603930e+09
3	1.786250	1.759000	1.769000	2020	11	46	11	2	316	False	False	False	False	False	False	1.605053e+09
4	1.420000	1.460000	1.540000	2020	11	46	12	3	317	False	False	False	False	False	False	1.605139e+09
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
56	2.220000	2.260000	2.340000	2021	6	26	28	0	179	False	False	False	False	False	False	1.624838e+09
57	2.632621	2.626935	2.646290	2021	7	26	2	4	183	False	False	False	False	False	False	1.625184e+09
58	2.606429	2.582633	2.611437	2021	7	27	5	0	186	False	False	False	False	False	False	1.625443e+09
59	2.619422	2.609042	2.632478	2021	7	27	6	1	187	False	False	False	False	False	False	1.625530e+09
60	2.571294	2.545455	2.575238	2021	7	27	7	2	188	False	False	False	False	False	False	1.625616e+09

61 rows × 16 columns

The add_datepart function generates lots of feature for the date, but we don't need to use all of them. For this model, we'll use last_updatedYear, last_updatedMonth and last_updatedDay. In future models, we can try to use other features.

data = data[['price500', 'last_updatedYear', 'last_updatedMonth', 'last_updatedDay']]

mask = round(len(data) * 0.7)
train = data[:mask]
test = data[mask:]
train.shape, test.shape

((43, 4), (18, 4))

After we split the data into a train, test set, we then must initalize the regression setup in PyCaret. This includes suppling the dataset, the feature we are predicting on (price500) and the other features to use(last_updatedYear,last_updatedMonth and last_updatedDay)

from pycaret.regression import *
s = setup(data = train, test_data = test, target = 'price500', fold_strategy = 'timeseries', numeric_features = ['last_updatedYear','last_updatedMonth', 'last_updatedDay'], fold = 3, transform_target = True, session_id = 42)

Next we call the compare_models function to find the best model using the Mean Absolute Error(MAE), which is the mean of the absolute difference between the models prediction and expected values.

best = compare_models(sort = 'MAE')

From the results above, it looks like the Passive Aggressive Regressor has the lowest MAE error(0.0605), so we'll use that model on the test set

preds = predict_model(best)

Next, we'll use this model and generate a forecast for the next 7 days worth of prices

forecast_df = pd.DataFrame()
forecast_df["last_updated"] = pd.date_range(start='2021-07-08', periods=8)
add_datepart(forecast_df, 'last_updated')
forecast_df = forecast_df[['last_updatedYear', 'last_updatedMonth', 'last_updatedDay']]
forecast_df

predictions = predict_model(best, data=forecast_df)
predictions

Once we make our predictions, we'll then merge the predictions to the orginal data and plot the last 15 days in the dataframe

def pad_value(day):
    value = str(int(day))
    print()
    if len(value) ==1:
        return f'0{value}'
    return value

results_df = pd.concat([data,predictions], axis=0)
dates = []
for idx, x in results_df.iterrows():
    date_str =  f'{pad_value(x["last_updatedYear"])}-{pad_value(x["last_updatedMonth"])}-{pad_value(x["last_updatedDay"])}'
    dates.append(pd.to_datetime(date_str, format='%Y-%m-%d'))
results_df["date"] = dates

results_df.drop(["last_updatedYear", "last_updatedMonth", "last_updatedDay"], axis=1,inplace=True)
results_df = results_df.set_index('date')

results_df[-15:].plot()

The blue line above shows the actual prices and the orange are the predicitons, which do not look the best. If we zoom in to the predictions. we'll see a increasing in price per day

results_df[-8:].plot()

Saving The Model

After we trained our model, we can now save and use for forecasting on other data

save_model(best, "model")

Transformation Pipeline and Model Succesfully Saved





(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[], ml_usecase='regression',
                                       numerical_features=['last_updatedYear',
                                                           'last_updatedMonth',
                                                           'last_updatedDay'],
                                       target='price500', time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_ca...
                                                  regressor=PassiveAggressiveRegressor(C=1.0,
                                                                                       average=False,
                                                                                       early_stopping=False,
                                                                                       epsilon=0.1,
                                                                                       fit_intercept=True,
                                                                                       loss='epsilon_insensitive',
                                                                                       max_iter=1000,
                                                                                       n_iter_no_change=5,
                                                                                       random_state=42,
                                                                                       shuffle=True,
                                                                                       tol=0.001,
                                                                                       validation_fraction=0.1,
                                                                                       verbose=0,
                                                                                       warm_start=False),
                                                  shuffle=True, tol=0.001,
                                                  validation_fraction=0.1,
                                                  verbose=0,
                                                  warm_start=False)]],
          verbose=False),
 'model.pkl')

Training Model for all Prices

After we have a intial model, we can now train 3 seperate models for each price(price150, price300 and price500)
We'll refactor the model training code into a function that trains each price, gets the model with the best score, and saves the model to a seperate file

def train(data, price, state="NewYork"):
    data = data.reset_index()
    data = data[ data["state"] == state].resample('d', on='last_updated').mean().dropna()
    data = data.reset_index()

    add_datepart(data, field_name="last_updated")
    data = data[[price, 'last_updatedYear', 'last_updatedMonth', 'last_updatedDay']]   
    mask = round(len(data) * 0.7)
    train = data[:mask]
    test = data[mask:]
    print("train",train.shape)
    print("test",test.shape)

    s = setup(data = train, test_data = test, target = price, fold_strategy = 'timeseries', numeric_features = ['last_updatedYear','last_updatedMonth', 'last_updatedDay'], fold=2, transform_target = True, session_id = 42)
    best = compare_models(sort = 'MAE')
    save_model(best, f'{price}_model')

for feature in ["price150", "price300", "price500"]:
    df = get_data()
    train(df, feature)

Transformation Pipeline and Model Succesfully Saved

Conclusion and What's Next

In this post, we've seen how to build a time series model for forecasting the price of heating oil. In the next post, we'll go over how to deploy these models into a StreamLit application.
We'll also go over how the process on how the data was collected.
We'll also go over how the process on how the data was collected.

Top comments (2)

lutharblunks • May 25 '23

Can we integrate that price updation feature in heat pump austin texas website so it extract all prices from all ac repairs and get an avg of all price and suggest me best one. Or is there any software which can do this?

Comment deleted

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

DEV Community

Predicting the price of Heating Oil using PyCaret

Predicting the price of Heating Oil using PyCaret

Background

Data

Fetching Saved Data

Model

Saving The Model

Training Model for all Prices

Conclusion and What's Next

Top comments (2)

Read next

Finally figured out a whole bunch of Nginx regex. It's more confusing than normal regex somehow

Dev Log #14 - Phaser.js and game dev

The Ultimate Guide to Design Patterns in Software Engineering

Best Practices in Software Architecture for Scalable, Secure, and Maintainable Systems