daud99

Posted on Apr 26, 2022 • Edited on Jan 28, 2024

Cleaning Data for Model Training

#machinelearning #datascience #python #preprocessing

Dimensionality Reduction

Before moving to Dimensionality Reduction. Our first question should be what exactly is Dimensionality?

The Dimensionality of our dataset is nothing but the features in our dataset except the Label/Target column. For instance, we have a total of 84 columns in our dataset. So, our dimensionality is 84-1=83. Here, 1 is representing the Label column.

Why we need to do the Dimensionality Reduction?

The thing is more features means more learning it will also result in more time in training the dataset. So, We should pick up the features which are contributing positively to the learning of our machine learning model. Moreover, if it is also our responsibility to remove features with noise which are negatively impacting our model learning.

Removing columns/feature with Zero Standard Deviation

We looked into which columns have Zero STD in previous blog.

stats = merge_df.describe()
std = stats.loc['std']
features_no_variance = std[std == 0.0].index
print(features_no_variance)

print("Datashape before removal")
print(merge_df.shape)
merge_df = merge_df.drop(columns=features_no_variance)
print("Datashape after removal")
print(merge_df.shape)

Removing unwanted string features

Most of the Machine learning model. Doesn't directly takes in the string features. So, you need to map them to number by Variable Encoding or you can also use the NLP. However, it is outside the scope right now.

We know from the previous blog. We have the following string columns/features.

The Categorical column are below
[' Timestamp', ' Destination IP', 'Flow ID', ' Label', ' Source IP']

Other than the Label column, all columns are not needed for our machine learning model. The Flow ID and Timestamp will be unique for each record and features with unique value for each record doesn't contribute much to the learning of model. It is same as saying they do have STD of zero but difference is that it's String not a numerical column/feature (.describe method doesn't deal with string features by default So, that why they don't show up there). We can not have a useful numerical mapping for Source IP and Destination IP as we do not want our model to predict on the basis of the Source IP and Destination IP. As IP can change in network. We may have one malicious IP in one local Network and the other one in other local network. Also, the IP assignment may change in the network. So, we want our Prediction independent of the IP. So, We will also remove these features/columns.

str_cols = list(set(merge_df.columns)-set(discriptive_stat.columns))
str_cols.remove(' Label')
print(str_cols)

The above snippet gives us the list of features. We need to drop.

merge_df = merge_df.drop(columns=str_cols)
print("Datashape after removal")
print(merge_df.shape)

Datashape after removal (2830434, 73)

So, we end up with 73 features at this point.

Removing Null and Infinite value

Just like I mentioned before the rows/records/instances with Null/infinite values will also not contribute to the machine learning model. They are one of the factor for noise in the model. So, let's deal with them.

Removing Infinite Values

Checking what column/s contain non-finite values.

labl = merge_df[' Label']
dataset = merge_df.loc[:, merge_df.columns != ' Label'].astype('float64')
nonfinite = [col for col in dataset if not np.all(np.isfinite(dataset[col]))]
print(nonfinite)

Checking how many non-finite values each column contains.

finite = np.isfinite(merge_df['Flow Bytes/s']).sum()
print("FlowBytes/s: "+str(merge_df.shape[0] - finite))
finite = np.isfinite(merge_df[' Flow Packets/s']).sum()
print("Flow Packets/s  "+str(merge_df.shape[0] - finite))

There are multiple ways of dealing with with the infinite values. For instance we may want to replace them with the mean of each column. Here, we are going to simply remove them as they are very few in number. Since there is a small number of non-finite values we can safely remove them from the dataset Replacing infinite values with NaN values.

merge_df = merge_df.replace([np.inf, -np.inf], np.nan)

Removing Null Values

Checking if there is any Null values.

dataset.isnull().values.any()

Checking which columns contain Null values.

[col for col in dataset if dataset[col].isnull().values.any()]

Checking How many Null values column contain

dataset['FlowBytes/s'].isnull().sum()

Can safely remove all NULL rows without spoiling the data as the number of Null rows are less than 0.1% of the entire dataset.

before = dataset.shape
dataset.dropna(inplace=True)
after = dataset.shape
print("No of removed rows: "+str(before[0] - after[0]))

Label Encoding

Finally, we are going to encode our Label i.e. each Label will be represented by a unique number. The reason being as discussed before Machine Learning model doesn't go well with the String/Categorical feature as it involves a lot of numerical computation. That's why we are going to do Label Encoding.

# Seperating labels from features
labels = merge_df[' Label']
print(labels)

# Labels have been replaced with integers.
LE = LabelEncoder()
LE.fit(labels)
encodedLabels = LE.transform(labels)
print(encodedLabels)

# Replacing string label with equivalent number labels
merge_df[' Label'] = LE.fit_transform(merge_df[' Label'])
merge_df.head()

Renaming Columns

# Removing whitespaces in column names.
col_names = [col.replace(' ', '') for col in merge_df.columns]
merge_df.columns = col_names
merge_df.head(2)

Saving Clean dataset for future Model Training

# Now we are saving dataset such that labels are encoded
dataset.to_csv(dataset_path+'processed/encodedDataset.csv', index=False)

DEV Community