So, I've built this project called RPAD-ML in my final year. It is essentially an Android app coupled with a machine learning backend server which detects 🕵️ any link that is a possible phishing site in REALTIME ⚡. It can detect malicious/phishing links from any app. Open any app which has external links 🔗, RPAD-ML will detect it in no time and gives you a warning message⚠️ right away.
I know there are lots of things available like Google safe browsing. But those are limited to chrome web browser. So, What I've done is used a machine learning model of phishing sites combined with Google safe browsing which when given a URL predicts whether it is a phishing website or not.
I've got a machine learning model built using dataset of phishing sites.
The dataset is downloaded from UCI machine learning repository. The dataset contains 31 columns, with 30 features and 1 target. The dataset has 2456 observations.
To fit the models over the dataset the dataset is split into training and testing sets. The split ratio is 75-25. Where in 75% accounts to training set.
Now the training set is used to train the classifier. The classifiers chosen are:
We will see which one fits best in our dataset.
Fitting logistic regression and creating confusion matrix of predicted values and real values I was able to get 92.3 accuracy. Which was good for a logistic regression model.
Support vector machine with a rbf kernel and using gridsearchcv to predict best parameters for svm was a really good choice, and fitting the model with predicted best parameters I was able to get 96.47 accuracy which is pretty good.
Next model I wanted to try was random forest and I will also get features importances using it, again using gridsearchcv to get best parameters and fitting best parameters to it I got very good accuracy 97.26.
Random forest was giving very good accuracy. We can also try artificial neural network to get a improved accuracy.
ML Model: Phishcoop
I've used the Heroku platform (Hobby plan provided by GitHub education) to host this machine learning model online. I used pickle to save and load the machine learning model and hosted it using Flask.
The idea was to put this as a service and then call it from the android app.
Essentially, this is the front-end to call this service. I've used Android's accessibility API to access and intercept network. Hence, I got the URLs being opened in any app using this method.
Now, after getting this url, firstly I call the Google safe browsing API to check whether it is a phishing site or not. If yes, I show a warning dialog else I call the machine learning backend server and using the result provided by it I again show warning dialog if the result comes as phishing site.
This was more like a prototype. While it is not that perfect, but hey it works 🙌🏻. And the best thing is I've learnt so much by working on this project 🤓