I'm a software engineering student and this is my first blog post! I'm writing this to seek feedback, improve my technical writing skills and, hopefully, provide insights on text processing with Machine Learning.
I'm currently tasked to do machine learning for Kormos, the startup I'm working with.
We are trying to find all the websites an user is subscribed to by looking at their emails. For that we have a database of mails and four thousands of them are human-tagged. This tag is referred to as 'isAccount' and is true when the email was sent from a website the user is subscribed to.
The tagged emails were selected based on keywords on their body field. such keywords are related to "account creation" or "email verification"
This results in a imbalanced data set.
For this project we're focusing on these data:
- sender Domain : the domain of the sender (e.g "kormos.com")
- langCode : the predicted language
We mostly have french emails. we're only considering french emails from now on.
I'm using Python on Google Colab.
I'm doing Machine Learning using scikit-learn.
I experimented with Spacy and consider using it to extract features from the body of emails. I'm thinking about extracting usernames or name of organizations.
I started training my model only on the subject field.
I'm using scikit's TfidfTransformer, an equivalent to CountVectorizer followed by TfidfTransformer.
from nltk.corpus import stopwords tfidfV=TfidfVectorizer( stop_words = set(stopwords.words('french')), max_features = 15 ) corpus_bow=tfidfV.fit_transform(data["subject"])
What it does is building a dictionary of the most common words, ignoring the stop words (frequent words of little value, like "I").
The size of the dictionary is, at most, equals to max_features.
Based on this dictionary, each text input is transformed into a vector of dimension max_features.
Basically, if "confirm" is the n-th word of the dictionary then n-th dimension of the output vector is the occurrences of the word "confirm".
Hence we have a count matrix. Numerical values instead of text.
This step transforms the count matrix to a normalized term-frequency representation.
It scale down the impact of words that appears very frequently.
I use scikit to divide my data into two groups: one to train my model and the other to test it.
y = data_fr["isAccount"] train_X, val_X, train_y, val_y = train_test_split(corpus_bow, y, random_state=1)
I specify the random_state to an arbitrary number to fix the seed of the randomness generator, hence making my results stable across the different executions of my code .
The model I'm using is Scikit's RandomForestClassifier because I understand it. It's training a number of decision tree classifier and using the aggregation of their predictions.
There are just so many models you can choose from.
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1) model.fit(train_X, train_y) y_pred = model.predict(val_X)
confusion matrix : [[ 18 29] [ 10 716]] accuracy = 94.955% precision = 96.107%
We get good result, however this is partly due to the imbalance in the distribution of the classes.
With these predictions, we can easily create a list of unique sender domains the user is predicted to be subscribed to.
I filter the list of domains by removing the ones not present in Alexa's top 1 million domains. Hopefully filtering any scam.
Assuming that the tagged data is correct and representative of future users, I believe that the model is good enough to be used.
However I wonder if removing data where isAccount is True is an effective way to improve the model. The cost of that strategy would be to train the model on a much smaller data set.
I have also been informed that data augmentation could be useful in this situation.
Please feel free to give feedback!
I can give additional information about any step of the process.
Thanks to Scikit and Panda for their documentation.
Thanks to Tancrède Suard, and Kormos, for its work on the dataset.