I'm a software engineering student and this is my first blog post! I'm writing this to seek feedback, improve my technical writing skills and, hopefully, provide insights on text processing with Machine Learning.
I'm currently tasked to do machine learning for Kormos, the startup I'm working with.
Our project
We are trying to find all the websites an user is subscribed to by looking at their emails. For that we have a database of mails and four thousands of them are human-tagged. This tag is referred to as 'isAccount' and is true when the email was sent from a website the user is subscribed to.
The tagged emails were selected based on keywords on their body field. such keywords are related to "account creation" or "email verification"
This results in a imbalanced data set.
For this project we're focusing on these data:
- Subject
- Body
- sender Domain : the domain of the sender (e.g "kormos.com")
- langCode : the predicted language
We mostly have french emails. we're only considering french emails from now on.
Technical decisions
I'm using Python on Google Colab.
I'm doing Machine Learning using scikit-learn.
I experimented with Spacy and consider using it to extract features from the body of emails. I'm thinking about extracting usernames or name of organizations.
Processing text
I started training my model only on the subject field.
vectoring our text data
I'm using scikit's TfidfTransformer, an equivalent to CountVectorizer followed by TfidfTransformer.
from nltk.corpus import stopwords
tfidfV=TfidfVectorizer(
stop_words = set(stopwords.words('french')),
max_features = 15
)
corpus_bow=tfidfV.fit_transform(data["subject"])
What it does is building a dictionary of the most common words, ignoring the stop words (frequent words of little value, like "I").
The size of the dictionary is, at most, equals to max_features.
Based on this dictionary, each text input is transformed into a vector of dimension max_features.
Basically, if "confirm" is the n-th word of the dictionary then n-th dimension of the output vector is the occurrences of the word "confirm".
Hence we have a count matrix. Numerical values instead of text.
tfidf transformer
This step transforms the count matrix to a normalized term-frequency representation.
It scale down the impact of words that appears very frequently.
Splitting our dataset
I use scikit to divide my data into two groups: one to train my model and the other to test it.
y = data_fr["isAccount"]
train_X, val_X, train_y, val_y = train_test_split(corpus_bow, y, random_state=1)
I specify the random_state to an arbitrary number to fix the seed of the randomness generator, hence making my results stable across the different executions of my code .
Training our model
The model I'm using is Scikit's RandomForestClassifier because I understand it. It's training a number of decision tree classifier and using the aggregation of their predictions.
There are just so many models you can choose from.
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(train_X, train_y)
y_pred = model.predict(val_X)
Results
confusion matrix :
[[ 18 29]
[ 10 716]]
accuracy = 94.955%
precision = 96.107%
We get good result, however this is partly due to the imbalance in the distribution of the classes.
With these predictions, we can easily create a list of unique sender domains the user is predicted to be subscribed to.
I filter the list of domains by removing the ones not present in Alexa's top 1 million domains. Hopefully filtering any scam.
Conclusion - How to make it better?
Assuming that the tagged data is correct and representative of future users, I believe that the model is good enough to be used.
However I wonder if removing data where isAccount is True is an effective way to improve the model. The cost of that strategy would be to train the model on a much smaller data set.
I have also been informed that data augmentation could be useful in this situation.
Please feel free to give feedback!
I can give additional information about any step of the process.
Thanks to Scikit and Panda for their documentation.
Thanks to Tancrède Suard, and Kormos, for its work on the dataset.
Top comments (0)