DEV Community

Cover image for Train Url prediction
Muhammad Usman
Muhammad Usman

Posted on

Train Url prediction

As we navigate through an increasingly digitized world, the threat of cyber attacks looms large, making it crucial to have robust measures in place to protect ourselves. One such measure is the use of machine learning models to predict whether a URL is malicious or benign, and in this blog, we will explore just that.
The success of any machine learning model depends on the quality of the data it is trained on. In the case of building a model to predict whether a URL is malicious or benign, we need a dataset that contains examples of both types of URLs.
I found the perfect example of that Dataset

Processing

Preprocessing is a crucial step in machine learning that involves preparing the data for analysis by transforming it into a format that is suitable for the model. Preprocessing can involve a range of techniques, such as handling missing values, scaling the features, removing outliers, and encoding categorical variables. The importance of preprocessing lies in the fact that the quality of the data is directly proportional to the accuracy and effectiveness of the model. By preprocessing the data, we can remove noise and inconsistencies, standardize the features, and ensure that the data is in a format that is compatible with the machine learning algorithms. Preprocessing also helps to reduce the computational complexity of the model and can improve its performance by making it less sensitive to irrelevant or redundant features. Overall, preprocessing is an essential step in machine learning that can have a significant impact on the quality and efficacy of the model.
Imports

# Imports
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pickle
Enter fullscreen mode Exit fullscreen mode

Reading the dataset

# Read the CSV file
df = pd.read_csv('data.csv',index_col=False)

# Print the first 5 rows of the data
df.head()
Enter fullscreen mode Exit fullscreen mode

Columns

# Print the column labels of the data
df.columns
Enter fullscreen mode Exit fullscreen mode

Unique labels

# Print all unique values in the 'label' column
print(df['label'].unique())
Enter fullscreen mode Exit fullscreen mode

As we want the labels to be malicious or benign not good or bad we will map them

label_map = {'bad': "malicious", 'good': "benign"}

# use the map() method to replace the labels with numerical values
df['label'] = df['label'].map(label_map)
# Print all unique values in the 'label' column
print(df['label'].unique())
Enter fullscreen mode Exit fullscreen mode

Getting percentage of malicious or benign urls

# calculate the percentage of 'benign' and 'malicious' labels
label_counts = df['label'].value_counts(normalize=True)
benign_percent = label_counts['benign'] * 100
malicious_percent = label_counts['malicious'] * 100

# create a pie chart
labels = ['Benign', 'Malicious']
sizes = [benign_percent, malicious_percent]
colors = ['green', 'red']
explode = (0, 0.1)

plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=90)
plt.axis('equal')

plt.show()
Enter fullscreen mode Exit fullscreen mode

Removing Duplicates:

# remove the duplicate rows based on all columns
df = df.drop_duplicates()
df.head()
Enter fullscreen mode Exit fullscreen mode

Training

Training is the process of teaching a machine learning model to make accurate predictions by adjusting its internal parameters based on the input data. During training, the model is exposed to a set of labeled examples and iteratively updates its parameters to minimize the difference between the predicted output and the actual output. The objective of training is to create a model that can generalize well to new, unseen data by learning the underlying patterns and relationships in the training data. The quality of the training process depends on several factors, such as the size and diversity of the training data, the complexity of the model, and the choice of optimization algorithm. The goal of training is to achieve a balance between overfitting and underfitting, where the model is not too complex or too simple to make accurate predictions. Once the training process is complete, the model can be evaluated on a separate test dataset to assess its performance and generalization ability. Training is a critical step in the machine learning workflow that requires careful attention to ensure the model is both accurate and efficient.

Split Dataset

X = df['url']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Enter fullscreen mode Exit fullscreen mode

Making a model

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
Enter fullscreen mode Exit fullscreen mode

Train the model

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

Getting accuracy

from sklearn.metrics import accuracy_score, confusion_matrix

y_pred = rf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
Enter fullscreen mode Exit fullscreen mode

Storing the model as pkl file:

with open('model.pkl', 'wb') as f:
    pickle.dump(rf, f)
Enter fullscreen mode Exit fullscreen mode

How to use the model.pkl:

with open('model.pkl', 'rb') as f:
    model = pickle.load(f)

new_url = ['google.com/../../etc/pwd']
new_url_transformed = vectorizer.transform(new_url)
prediction = model.predict(new_url_transformed)
print(prediction)
Enter fullscreen mode Exit fullscreen mode

This will output either 'malicious' or 'benign' depending on the prediction of the model.

Summary

In this conversation, we discussed several key aspects of machine learning, including finding datasets, preprocessing, and training. We highlighted the importance of finding quality datasets that are representative of the real-world scenarios the model is expected to encounter. Preprocessing was also discussed as a critical step to prepare the data for analysis by transforming it into a format that is suitable for the model. Lastly, we explored the training process, which involves teaching the model to make accurate predictions by adjusting its internal parameters based on the input data. Training is an iterative process that requires careful attention to ensure the model is both accurate and efficient. Overall, these are essential steps in the machine learning workflow that can significantly impact the performance and efficacy of the model.

The code can be found at Kaggle

Top comments (0)