Before moving to Dimensionality Reduction. Our first question should be what exactly is Dimensionality?
The Dimensionality of our dataset is nothing but the features in our dataset except the Label/Target column. For instance, we have a total of 84 columns in our dataset. So, our dimensionality is 84-1=83. Here, 1 is representing the Label column.
The thing is more features means more learning it will also result in more time in training the dataset. So, We should pick up the features which are contributing positively to the learning of our machine learning model. Moreover, if it is also our responsibility to remove features with noise which are negatively impacting our model learning.
We looked into which columns have Zero STD in previous blog.
stats = merge_df.describe() std = stats.loc['std'] features_no_variance = std[std == 0.0].index print(features_no_variance)
print("Datashape before removal") print(merge_df.shape) merge_df = merge_df.drop(columns=features_no_variance) print("Datashape after removal") print(merge_df.shape)
Most of the Machine learning model. Doesn't directly takes in the string features. So, you need to map them to number by Variable Encoding or you can also use the NLP. However, it is outside the scope right now.
We know from the previous blog. We have the following string columns/features.
The Categorical column are below
[' Timestamp', ' Destination IP', 'Flow ID', ' Label', ' Source IP']
Other than the Label column, all columns are not needed for our machine learning model. The Flow ID and Timestamp will be unique for each record and features with unique value for each record doesn't contribute much to the learning of model. It is same as saying they do have STD of zero but difference is that it's String not a numerical column/feature (.describe method doesn't deal with string features by default So, that why they don't show up there). We can not have a useful numerical mapping for Source IP and Destination IP as we do not want our model to predict on the basis of the Source IP and Destination IP. As IP can change in network. We may have one malicious IP in one local Network and the other one in other local network. Also, the IP assignment may change in the network. So, we want our Prediction independent of the IP. So, We will also remove these features/columns.
str_cols = list(set(merge_df.columns)-set(discriptive_stat.columns)) str_cols.remove(' Label') print(str_cols)
The above snippet gives us the list of features. We need to drop.
merge_df = merge_df.drop(columns=str_cols) print("Datashape after removal") print(merge_df.shape)
Datashape after removal (2830434, 73)
So, we end up with 73 features at this point.
Just like I mentioned before the rows/records/instances with Null/infinite values will also not contribute to the machine learning model. They are one of the factor for noise in the model. So, let's deal with them.
Checking what column/s contain non-finite values.
labl = merge_df[' Label'] dataset = merge_df.loc[:, merge_df.columns != ' Label'].astype('float64') nonfinite = [col for col in dataset if not np.all(np.isfinite(dataset[col]))] print(nonfinite)
Checking how many non-finite values each column contains.
finite = np.isfinite(merge_df['Flow Bytes/s']).sum() print("FlowBytes/s: "+str(merge_df.shape - finite)) finite = np.isfinite(merge_df[' Flow Packets/s']).sum() print("Flow Packets/s "+str(merge_df.shape - finite))
There are multiple ways of dealing with with the infinite values. For instance we may want to replace them with the mean of each column. Here, we are going to simply remove them as they are very few in number. Since there is a small number of non-finite values we can safely remove them from the dataset Replacing infinite values with NaN values.
merge_df = merge_df.replace([np.inf, -np.inf], np.nan)
Checking if there is any Null values.
Checking which columns contain Null values.
[col for col in dataset if dataset[col].isnull().values.any()]
Checking How many Null values column contain
Can safely remove all NULL rows without spoiling the data as the number of Null rows are less than 0.1% of the entire dataset.
before = dataset.shape dataset.dropna(inplace=True) after = dataset.shape print("No of removed rows: "+str(before - after))
Finally, we are going to encode our Label i.e. each Label will be represented by a unique number. The reason being as discussed before Machine Learning model doesn't go well with the String/Categorical feature as it involves a lot of numerical computation. That's why we are going to do Label Encoding.
# Seperating labels from features labels = merge_df[' Label'] print(labels)
# Labels have been replaced with integers. LE = LabelEncoder() LE.fit(labels) encodedLabels = LE.transform(labels) print(encodedLabels)
# Replacing string label with equivalent number labels merge_df[' Label'] = LE.fit_transform(merge_df[' Label']) merge_df.head()
# Removing whitespaces in column names. col_names = [col.replace(' ', '') for col in merge_df.columns] merge_df.columns = col_names merge_df.head(2)
# Now we are saving dataset such that labels are encoded dataset.to_csv(dataset_path+'processed/encodedDataset.csv', index=False)