What is your routine in a data science project? - fastai lesson1

#beginners #python #machinelearning #datascience

Hi there,

I am currently studying machine learning with fast.ai.

I organized my notes for lesson1 below in 3 categories: a process summary, code used and concepts seen in the video. The best way to learn is to be able to reexplain all of this and to apply this new knowledge to Kaggle competitions.

1. Process summary

set up jupyter notebook and the environment
download data from kaggle
convert all data into numbers or booleans: new features are extracted from dates (year, month) and string categorical data are mapped to numbers
take care of missing data: continuous missing data are replaced with the median and a new feature column is created _na with a boolean value, missing categorical variables are handled by pandas and automatically set to -1
separate training and validation set: the model is trained on the training set, to check if it is working well, it is used on the validation set after
train the model
print accuracy scores

2. Code

%load_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.imports import *
from fastai.structured import *
from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display
from sklearn import metrics

PATH = "data/bulldozers/"

df_raw = pd.read_csv(f'{PATH}Train.csv', low_memory=False, parse_dates=["saledate"])

display_all(df_raw.tail().T)

df_raw.SalePrice = np.log(df_raw.SalePrice)

add_datepart(df_raw, 'saledate')
df_raw.saleYear.head()

train_cats(df_raw)
df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True)
df_raw.UsageBand = df_raw.UsageBand.cat.codes

display_all(df_raw.isnull().sum().sort_index()/len(df_raw))

os.makedirs('tmp', exist_ok=True)
df_raw.to_feather('tmp/bulldozers-raw')
df_raw = pd.read_feather('tmp/bulldozers-raw')

df, y, nas = proc_df(df_raw, 'SalePrice')

m = RandomForestRegressor(n_jobs=-1)
m.fit(df, y)
m.score(df,y)

n_valid = 12000  # same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

X_train.shape, y_train.shape, X_valid.shape

There are a few helper functions:

display_all() is created to have the name of all the features in lines instead of columns
split_vals() splits the dataset into a training set and a validation set
print_score()

def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 
        display(df)

def split_vals(a,n): return a[:n].copy(), a[n:].copy()

def rmse(x,y): return math.sqrt(((x-y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
                m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

The following functions are included in fast.ai library:

add_datepart() generates new numerical features from dates (year, month) and remove the previous column dates
train_cats() maps strings to integers (ex: red: 1, blue: 2, etc)
proc_df() replaces categories with their numeric codes, handles missing continuous values (replaces it by a median value and creates a new feature column _na) and split the dependent variable into a separate variable

3. Concepts

structured data/unstructured data: structured data are tabular data, an example of unstructured data is images
curse of dimensionality: idea that the more dimensions you have, the more all of the points sit on the edge of that space
no free-lunch theory: in theory, there is no type of model that will work for any kind of random data set
regression/classification: regression is continuous variable prediction (ex: price prediction) and classification is true/false categorization or identification of multiple categories (ex: categorization of fruits)
overfitting: when a model is to specific to a dataset, it will not be able to generalize well with a new dataset, a validation set helps diagnose this problem

That is it for the first lesson!

Don’t forget to recall. Are you able to explain the basic process to begin a data science notebook? How do you handle missing values? What is regression and classification? What is over-fitting?

And practice with Kaggle to make this new knowledge a second nature!

Note: I think that the easier way to begin a data science notebook is to use Google Colab.

!pip install fastai==0.7.0
from google.colab import files
uploaded = files.upload()
import pandas as pd
import io
df_raw = pd.read_csv(io.BytesIO(uploaded['train.csv']))

You can follow me on Instagram @oyane806 !

DEV Community

What is your routine in a data science project? - fastai lesson1

1. Process summary

2. Code

3. Concepts

Top comments (0)

Read next

High Level requirements and purposes

Using migrations with Golang

Exploring Multicollinearity: Strategies for Detecting and Managing Correlated Predictors in Regression Analysis

Setup path aliases in React + Vite + TS Project 🤓