Beginning with Machine Learning (6 Part Series)
In our previous article, we discussed the classification technique in theory. It’s time to play with the code 😉 Before we can start coding, the following libraries need to be installed in our system:
- Pandas: pip install pandas
- Numpy: pip install numpy
- scikit-learn: pip install scikit-learn
The task here is to classify Mammographic Masses as benign or malignant using different Classification algorithms including SVM, Logistic Regression and Decision Trees. Benign is when the tumor doesn’t invade other tissues whereas malignant does spread. Mammography is the most effective method for breast cancer screening available today.
The dataset used in this project is “Mammographic masses” which is a public dataset from UCI repository (https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)
It can be used to predict the severity (benign or malignant) of a mammographic mass from BI-RADS attributes and the patient’s age. Number of Attributes: 6 (1 goal field: severity, 1 non-predictive: BI-RADS, 4 predictive attributes)
- BI-RADS assessment: 1 to 5 (ordinal)
- Age: patient’s age in years (integer)
- Shape (mass shape): round=1, oval=2, lobular=3, irregular=4 (nominal)
- Margin (mass margin): circumscribed=1, microlobulated=2, obscured=3, ill- defined=4, spiculated=5 (nominal)
- Density (mass density): high=1, iso=2, low=3, fat-containing=4 (ordinal)
- Severity: benign=0 or malignant=1 (binomial)
So we talked a lot about the theory behind it. It’s fairly simple to build a classification model. Follow the below steps and get your own model in an hour 😃 So let’s get started!
Create a new IPython Notebook and insert the below code to import the necessary modules. In case you get any error, do install the necessary packages using pip.
import numpy as np import pandas as pd from sklearn import model_selection from sklearn.preprocessing import StandardScaler from sklearn import tree from sklearn import svm from sklearn import linear_model
Read the data using pandas into a dataframe. To check the top 5 rows of the dataset, use
df.head() . You can specify the number of rows as an argument to this function in case you want to check different number of rows. BI-RADS attribute has been given as non-predictive in the dataset and so it won’t be taken into consideration.
input_file = 'mammographic_masses.data.txt' masses_data = pd.read_csv(input_file,names =['BI-RADS','Age','Shape','Margin','Density','Severity'],usecols = ['Age','Shape','Margin','Density','Severity'],na_values='?') masses_data.head(10)
You can get a description of the data like values of count, mean, standard deviation etc as
As you might have observed, there are missing values in the dataset. Handling missing data is something very important in data preprocessing. We fill out the empty values using the mean or mode of the column depending on the data analysis. For simplicity, as of now, you can drop the null values from the data.
masses_data = masses_data.dropna() features = list(masses_data.columns[:4]) X = masses_data[features].values print(X) labels = list(masses_data.columns[4:]) y = masses_data[labels].values y = y.ravel() print(y)
X contains the input features from column 1 to 4 except the target variable. Their values will be used for training. The target variable i.e Severity is stored in the vector
Scale the input features to normalize the data within a particular range. Here we are using
StandardScaler() which transforms the data to have a mean value 0 and standard deviation of 1.
scaler = StandardScaler() X = scaler.fit_transform(X) print(X)
Create training and testing set using
train_test_split. 25% of the data is used for testing and 75% for training.
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=0)
To build a Decision Tree Classifier from the training set, we just need to use the function
DecisionTreeClassifier() It has a certain number of parameters about which you can find on the scikit-learn documentation. For now, we would just use the default values of each parameter. Use
predict() on the test input features
X_test to get the predicted values
y_pred. The function
score() can be used directly to compute the accuracy of prediction on test samples.
clf = tree.DecisionTreeClassifier(random_state=0) clf = clf.fit(X_train,y_train) y_pred = clf.predict(X_test) print(y_pred) clf.score(X_test, y_test)
DecisionTreeClassifier() without any tuning gives a result around 77% which we can say is not the worst.
To build an SVM classifier, the classes provided by scikit-learn include SVC, NuSVC, and LinearSVC. We will build a classifier using SVC class and linear kernel. (To know the difference between SVC with linear kernel and LinearSVC you can go to the link — https://stackoverflow.com/questions/45384185/what-is-the-difference-between-linearsvc-and-svckernel-linear/45390526)
svc = svm.SVC(kernel='linear', C=1) scores = model_selection.cross_val_score(svc,X,y,cv=10) print(scores) print(scores.mean())
In this section, I am trying to show you a different approach for creating a classifier. The
svc classifier object is created using the SVC class on the training set.
cross_val_score() function evaluates score using cross-validation method. Cross-validation is used to avoid any kind of overfitting. k-Fold cross-validation implies k-1 folds of data is used for training and 1 fold for testing. The score obtained using this is around 79.5%
Similar to the Decision Tree Classifier, we can also create Logistic Regression classifier. The function
LogisticRegression() is used. The classifier is fitted on the training set and similarly used to predict target values for the test set. It gives a mean score of 80.5%
clf = linear_model.LogisticRegression(C=1e5) clf = clf.fit(X_train, y_train) y_pred = clf.predict(X_test) scores = model_selection.cross_val_score(clf,X,y,cv=10) print(scores) print(scores.mean())
Thus, if we want to build a single classifier we can do it in just 10 lines of code😄. And in no effort, we achieved an accuracy of 80%. You can create your own classification models (there are plenty of options) or fine-tune any of these. Also if you are interested you can give a shot to Artificial Neural Networks as well 😍. For me, I got the best accuracy of 84% with ANNs. To get the entire code, please use this link
If you liked the article do show some ❤ Stay tuned for more! Till then happy learning 😸