Hope you all are doing well in this lockdown. From the title, it might be confusing what note you should remember, I started exploring the field of data i.e. ML, DS, DL, etc. & actually it is pretty cool when you find a future prediction as to the output of your code.
so we think it's all about learning a number of algorithms, then 2 to 4 libraries of python for cleaning the data, and then it's done!!!!!
so the most important thing I am going to discuss here is the backbone of this field i.e "Data"
The number of algorithms is fixed into three categories
- supervised(you know the past relation(labeled data))
- unsupervised(no past relation is known to you, you form a different group out of them)
- reinforcement(you get rewarded with success and vice versa)
After getting familiar with these you now try to learn how to implement them on data to predict future outcomes.
Basically, we have two types of data:
- structured data
- unstructured data
structured data means no data cleaning part(the different terms like visualization, wrangling you have heard)..you just import it and then train_test_split and fit the model.
Now let's get our hand's dirty with the unstructured data because that's what I learned in these months, we will always face the unstructured data.
so basically I am going to use the following libraries for this purpose:
Step1:-importing the libraries:
- NumPy -
import numpy as np(for data preprocessing)
- pandas -
import pandas as pd(for data cleaning)
- matplotlib -
import matplotlib.pyplot as plt(for data visualization)
- seaborn -
import seaborn as sns(for data visualization)
Matplotlib is a python library used to create 2D graphs and plots by using python scripts. But I think if you're handling a larger dataset with very much non-linearity seaborn should be your major weapon
step2:-where and how to use different plots of seaborn:
|plot name||where to use||how to use|
|heatmap||basically used to know the overall information and relation between the data||sns.heatmap(data)|
|barplot||when we are comparing between two categories||sns.barplot(value1,value2)|
|countplot||same as barplot but use to know the occurrence of a label||sns.countplot(value,data)|
|distplot||used to get the distribution of data||sns.distplot(data)|
|box-plot||shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable||sns.boxplot(data)|
Now after you have completely visualized the data provided and understand the relation between different parameters provided to you, you're ready to clean your data.
However, we found these problems while handling the unstructured data:
- categorical columns
- null values(Nan)
- biased column
- no values(blank)
starting from the end the outliers are like the Cardamom to your biriyani, They're the ones which will cause less accuracy of your model
how to solve??
1)univariate method:-This method looks for data points with extreme values on one variable.
2)multivariate method:-Here we look for unusual combinations on all the variables
3)Minkowski error: This method reduces the contribution of potential outliers in the training process
blank values i.e. missing values sometimes you will see some data is missing in some columns but the output depends on that data so you have to fill that place accordingly with the maximum frequency of the data, some times the average of the data
data.fillna(value)-when you put a fixed value
data.fillna(method = bfill\ffill)-backward/forward filling
now what is a biased column:- suppose for a prediction you have a gender column in data which is required for prediction, but the male: female ratio is 95:5, this called a biased column, so try to keep values appropriately else the model will predict according to a single value.
The traditional method to deal with null value is to drop them
data.dropna(), but if it is required for your prediction instead of dropping it try to fill this place by replacing with another value as mentioned in no values case
Last but not least how to handle the categorical columns
Easy to use and fast way to handle categorical column values.(ps: not useful for many categories)
b)When the categorical variables are ordinal(labeled), the easiest approach is to replace each label(not useful for nominal)
c) one hot encoding:-applicable for a lesser number categories i.e. convert the data in 1 or 0
from sklearn.compose import ColumnTransformer
ColumnTransformer([('encoder', OneHotEncoder(), [no.of categories])], remainder='passthrough')
data = np.array(columnTransformer.fit_transform(data), dtype = np.str)
LabelEncoder:- the most useful part to convert any number of categories into different numerical values
from sklearn.preprocessing import LabelEncoder
That's all. Hope this will help you a lot in data preprocessing and in ml term, we call feature engineering
for examples you can check my githup repo :-https://github.com/Ashishkumarpanda
just a beginner do comment any other methods if I missed something.Thank you :)