Classification from scratch — Mammographic Mass Classification

#machinelearning #beginners #python #datascience

In our previous article, we discussed the classification technique in theory. It’s time to play with the code 😉 Before we can start coding, the following libraries need to be installed in our system:

Pandas: pip install pandas
Numpy: pip install numpy
scikit-learn: pip install scikit-learn

The task here is to classify Mammographic Masses as benign or malignant using different Classification algorithms including SVM, Logistic Regression and Decision Trees. Benign is when the tumor doesn’t invade other tissues whereas malignant does spread. Mammography is the most effective method for breast cancer screening available today.

Dataset

The dataset used in this project is “Mammographic masses” which is a public dataset from UCI repository (https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)

It can be used to predict the severity (benign or malignant) of a mammographic mass from BI-RADS attributes and the patient’s age. Number of Attributes: 6 (1 goal field: severity, 1 non-predictive: BI-RADS, 4 predictive attributes)

Attribute Information:

BI-RADS assessment: 1 to 5 (ordinal)
Age: patient’s age in years (integer)
Shape (mass shape): round=1, oval=2, lobular=3, irregular=4 (nominal)
Margin (mass margin): circumscribed=1, microlobulated=2, obscured=3, ill- defined=4, spiculated=5 (nominal)
Density (mass density): high=1, iso=2, low=3, fat-containing=4 (ordinal)
Severity: benign=0 or malignant=1 (binomial)

Screenshot of top 10 rows of the dataset

So we talked a lot about the theory behind it. It’s fairly simple to build a classification model. Follow the below steps and get your own model in an hour 😃 So let’s get started!

Approach

Create a new IPython Notebook and insert the below code to import the necessary modules. In case you get any error, do install the necessary packages using pip.

import numpy as np
import pandas as pd
from sklearn import model_selection
from sklearn.preprocessing import StandardScaler
from sklearn import tree
from sklearn import svm
from sklearn import linear_model

Read the data using pandas into a dataframe. To check the top 5 rows of the dataset, use df.head() . You can specify the number of rows as an argument to this function in case you want to check different number of rows. BI-RADS attribute has been given as non-predictive in the dataset and so it won’t be taken into consideration.

input_file = 'mammographic_masses.data.txt'
masses_data = pd.read_csv(input_file,names =['BI-RADS','Age','Shape','Margin','Density','Severity'],usecols = ['Age','Shape','Margin','Density','Severity'],na_values='?')
masses_data.head(10)

You can get a description of the data like values of count, mean, standard deviation etc as masses_data.describe()

As you might have observed, there are missing values in the dataset. Handling missing data is something very important in data preprocessing. We fill out the empty values using the mean or mode of the column depending on the data analysis. For simplicity, as of now, you can drop the null values from the data.

masses_data = masses_data.dropna()
features = list(masses_data.columns[:4])
X = masses_data[features].values
print(X)
labels = list(masses_data.columns[4:])
y = masses_data[labels].values
y = y.ravel()
print(y)

The vector X contains the input features from column 1 to 4 except the target variable. Their values will be used for training. The target variable i.e Severity is stored in the vector y.

Scale the input features to normalize the data within a particular range. Here we are using StandardScaler() which transforms the data to have a mean value 0 and standard deviation of 1.

scaler  = StandardScaler()
X = scaler.fit_transform(X)
print(X)

Create training and testing set using train_test_split. 25% of the data is used for testing and 75% for training.

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=0)

To build a Decision Tree Classifier from the training set, we just need to use the function DecisionTreeClassifier() It has a certain number of parameters about which you can find on the scikit-learn documentation. For now, we would just use the default values of each parameter. Use predict() on the test input features X_test to get the predicted values y_pred. The function score() can be used directly to compute the accuracy of prediction on test samples.

clf = tree.DecisionTreeClassifier(random_state=0)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print(y_pred)
clf.score(X_test, y_test)

The DecisionTreeClassifier() without any tuning gives a result around 77% which we can say is not the worst.

To build an SVM classifier, the classes provided by scikit-learn include SVC, NuSVC, and LinearSVC. We will build a classifier using SVC class and linear kernel. (To know the difference between SVC with linear kernel and LinearSVC you can go to the link — https://stackoverflow.com/questions/45384185/what-is-the-difference-between-linearsvc-and-svckernel-linear/45390526)

svc = svm.SVC(kernel='linear', C=1)
scores = model_selection.cross_val_score(svc,X,y,cv=10)
print(scores)
print(scores.mean())

In this section, I am trying to show you a different approach for creating a classifier. The svc classifier object is created using the SVC class on the training set. cross_val_score() function evaluates score using cross-validation method. Cross-validation is used to avoid any kind of overfitting. k-Fold cross-validation implies k-1 folds of data is used for training and 1 fold for testing. The score obtained using this is around 79.5%

Cross-Validation (source:https://scikit-learn.org/stable/modules/cross_validation.html)

Similar to the Decision Tree Classifier, we can also create Logistic Regression classifier. The function LogisticRegression() is used. The classifier is fitted on the training set and similarly used to predict target values for the test set. It gives a mean score of 80.5%

clf = linear_model.LogisticRegression(C=1e5)
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
scores = model_selection.cross_val_score(clf,X,y,cv=10)
print(scores)
print(scores.mean())

Thus, if we want to build a single classifier we can do it in just 10 lines of code😄. And in no effort, we achieved an accuracy of 80%. You can create your own classification models (there are plenty of options) or fine-tune any of these. Also if you are interested you can give a shot to Artificial Neural Networks as well 😍. For me, I got the best accuracy of 84% with ANNs. To get the entire code, please use this link

If you liked the article do show some ❤ Stay tuned for more! Till then happy learning 😸