DEV Community

Cover image for A NOTE TO REMEMBER
Spectrum Club


ashish12 profile image Ashish kumar panda ・4 min read

Hello Everyone

Hope you all are doing well in this lockdown. From the title, it might be confusing what note you should remember, I started exploring the field of data i.e. ML, DS, DL, etc. & actually it is pretty cool when you find a future prediction as to the output of your code.

So what is the first thing that comes to our mind when we hear about this field??

Alt Text

so we think it's all about learning a number of algorithms, then 2 to 4 libraries of python for cleaning the data, and then it's done!!!!!

so the most important thing I am going to discuss here is the backbone of this field i.e "Data"

Alt Text

The number of algorithms is fixed into three categories

  • supervised(you know the past relation(labeled data))
  • unsupervised(no past relation is known to you, you form a different group out of them)
  • reinforcement(you get rewarded with success and vice versa)

After getting familiar with these you now try to learn how to implement them on data to predict future outcomes.
Basically, we have two types of data:

  • structured data
  • unstructured data

structured data means no data cleaning part(the different terms like visualization, wrangling you have heard) just import it and then train_test_split and fit the model.

Alt Text

Now let's get our hand's dirty with the unstructured data because that's what I learned in these months, we will always face the unstructured data.

so basically I am going to use the following libraries for this purpose:
Step1:-importing the libraries:

  • NumPy - import numpy as np(for data preprocessing)
  • pandas - import pandas as pd(for data cleaning)
  • matplotlib - import matplotlib.pyplot as plt(for data visualization)
  • seaborn - import seaborn as sns(for data visualization)

Matplotlib is a python library used to create 2D graphs and plots by using python scripts. But I think if you're handling a larger dataset with very much non-linearity seaborn should be your major weapon

step2:-where and how to use different plots of seaborn:

plot name where to use how to use
heatmap basically used to know the overall information and relation between the data sns.heatmap(data)
barplot when we are comparing between two categories sns.barplot(value1,value2)
countplot same as barplot but use to know the occurrence of a label sns.countplot(value,data)
distplot used to get the distribution of data sns.distplot(data)
box-plot shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable sns.boxplot(data)

Now after you have completely visualized the data provided and understand the relation between different parameters provided to you, you're ready to clean your data.

However, we found these problems while handling the unstructured data:

  • categorical columns
  • null values(Nan)
  • biased column
  • no values(blank)
  • outliers

  • starting from the end the outliers are like the Cardamom to your biriyani, They're the ones which will cause less accuracy of your model
    how to solve??

    1)univariate method:-This method looks for data points with extreme values on one variable.
    2)multivariate method:-Here we look for unusual combinations on all the variables
    3)Minkowski error: This method reduces the contribution of potential outliers in the training process

  • blank values i.e. missing values sometimes you will see some data is missing in some columns but the output depends on that data so you have to fill that place accordingly with the maximum frequency of the data, some times the average of the data
    data.fillna(value)-when you put a fixed value
    data.fillna(method = bfill\ffill)-backward/forward filling
    data.fillna(data.mean())-average value

  • now what is a biased column:- suppose for a prediction you have a gender column in data which is required for prediction, but the male: female ratio is 95:5, this called a biased column, so try to keep values appropriately else the model will predict according to a single value.

  • The traditional method to deal with null value is to drop them
    data.dropna(), but if it is required for your prediction instead of dropping it try to fill this place by replacing with another value as mentioned in no values case

  • Last but not least how to handle the categorical columns

  • a)creating dummies:
    Easy to use and fast way to handle categorical column values.(ps: not useful for many categories)

  • b)When the categorical variables are ordinal(labeled), the easiest approach is to replace each label(not useful for nominal)

  • c) one hot encoding:-applicable for a lesser number categories i.e. convert the data in 1 or 0
    from sklearn.compose import ColumnTransformer
    ColumnTransformer([('encoder', OneHotEncoder(), [no.of categories])], remainder='passthrough')
    data = np.array(columnTransformer.fit_transform(data), dtype = np.str)

  • LabelEncoder:- the most useful part to convert any number of categories into different numerical values
    from sklearn.preprocessing import LabelEncoder

That's all. Hope this will help you a lot in data preprocessing and in ml term, we call feature engineering

for examples you can check my githup repo :-

just a beginner do comment any other methods if I missed something.Thank you :)


Editor guide