Machine Learning has a lot of novel and great applications in the area of Health-care and can make patient diagnosis much easier and accurate taking in consideration right amount data is used in term of size and has meaningful relation to the problem.
the problem this article will cover is a classification problem to classify whether a person is a diabetic or not taking into consideration attributes like Insulin, Blood pressure, Skin-Thickness, BMI, Age, Glucose, Pregnancies, Diabetes Pedigree Function and the output should be whether a person has diabetes or not.
we will be using K-Nearest Neighbour classifier and Logistic Regression and compare the accuracy of both methods and which one fit the requirements of the problem but first let's explain what is K-Nearest Neighbour Classifier and Logistic Regression
K-Nearest is a Distance based Algorithm which means it does take distance in consideration when learning a data set, K-Nearest tries to classify which data point belongs to which class, let's say we have a finite number of data points on a graph from these finite number data points we have five data points near to each other which implies they have a-lot in common so hypothetically it's safe to consider them a class and this what K-nearest tries to achieve to classify points to a class by clustering points similar to each other as a class.
did you ever wonder what is K, it might not seem obvious that K is a variable and it changes depending on the problem but the most common value for k is between 5 and 10, what K represents is the number of data points we take in consideration when forming a class or classifying a data point for example if you chose the K to be 2 or 3 then the point you're trying to classify will look to the nearest 2 or 3 points and what class do they belong to and it will be classified to the class with the shortest distance.
Regression sounds like a whole different problem it seems like we are expecting a continuous instead of a discrete output but it's not true Logistic Regression is a classification but why are we using the word regression ?
because Logistic Regression is another scientific name for the function where the algorithm is based on which is called Sigmoid function so it's just a naming convention so in order to have a good grasp of Logistic Regression one should understand Sigmoid function first, Sigmoid function is a function with output range between 0 and 1 so it's widely used in probability predicting models as in Logistic Regression when we try to classify a class for example whether a person is diabetic or not Logistic Regression outputs the probability that a case(input data) belongs to a certain class or not based on the Sigmoid Algorithm so for example if output of a given input is less than 0.5 then he's not diabetic else he's diabetic.
First we have to do some Feature Engineering for the dataset, since we know KNN is a distance based Algorithm and it uses distance function like Manhattan or Euclidean to calculate distance between two points we have to keep value of the attributes in control and small range to avoid wasted computational power and complexity so we are going to normalize the dataset since we have features with different range in nature after that we search for missing values 0 or NaN and sometimes the empty values is denoted as '?' also be careful some values can have 0 without any problems like, number of brothers or sisters but in case 0 doesn't make similar to our case where blood pressure in some cases we have do what is called Imputation where we substitute empty data values with relevant values using different strategies like mean or mode of the column of the missing values or by using regression to predict the missing value then we use Sklearn to traing data on knn and logistic classifier and then we present our metrics using evolution matrix and that's it
the full notebook can be found here :-