Authored in connection with the Write With Fauna program
In this article, we will explain the steps required to build an educational app (that recommends best learning materials to it’s users based on an algorithm that functions from the user’s history pattern). In building this application, Fauna serverless database, Python and Recommendation Model were utilized.
The educational app aims to assist users by suggesting choicest materials to aid their study. This article further explains how the user’s pattern is used to suggest better ways to source for these materials using a recommendation model.
+Some of the broad areas related to this work include the following:
- Fauna serverless Platform.
- Python platform.
- Machine Learning/Artificial Learning.
Here, the first thing to do is to create the database for our educational app in the Fauna dashboard. If you are new to Fauna and yet to create an account, you will be required to do so via the link attached here:
Fauna Acct Sign Up
In the Fauna dashboard, click on the “NEW DATABASE” button, provide a database name and click on the save button.
In order to connect the database to our educational app, we need to generate a Fauna API. We do this by navigating to security settings on the Fauna sidebar (located at the top left-side of the screen).
Once this has been done, you are presented with your API keys ( best to be kept a secret and copied in a place that is easy to retrieve).
At this stage, we proceed to get the Python Library for Fauna. This is available on the pip and can be installed with a single line on the terminal.
After installation is completed, we run the sample code provided in Fauna Python driver docs:
The above code provides further explanation on how the Fauna Python driver connects to a database with its API key and prints all its indexes.
Now, let us consider the steps to follow to build our educational recommendation application from the Front-End.
The term Machine Learning and Artificial Intelligence are used interchangeably very often. But in practice and theory, they do not mean the same thing.
Machine Learning involves application of Artificial Intelligence to systems and thereby providing the ability for these systems to automatically learn and improve from experience without being programmed manually. It involves the ability for computers to be able to develop and explore data, then utilize these data sets for themselves by identifying patterns with very minimal human intervention.
++Machine Learning(ML) can be grouped into three(3) broad categories:
- Reinforcement Learning.
- Supervised Learning.
- Unsupervised Learning.
Artificial Intelligence (AI) is an umbrella terminology that refers to any machine that performs a smart task. AI is a technology that enables machines to simulate human behaviour. In other words, we can refer to all Machine Learning as Artificial Intelligence but not all Artificial Intelligence are Machine Learning.
Artificial Intelligence can also be grouped into four(4) types:
- Self Awareness
- Reactive Machines.
- Limited Memories.
- Theory of Mind.
The goal of AI is to produce software that can reason on input and explain on output. Although, AI produces software that will provide human-like interaction and offer relevant supportive decisions but not to be a replacement to humans, not anytime soon.
Some of the mostly used machine learning application includes:
- Medical Sector
- Banking and stock markets
- Speech and image recognition
- Product recommendation
- Online fraud detection
- Social media services.
Facial recognition and detection is an AI example that we use every day without realizing it. This technology works just like humans recognize the face and voice of other people.
This machine learning process lets the AI learn the facial coordinates of a human face and save it in their program for detection. Government and security sectors use this technology for restricted areas to deter unwanted personnel from entering the premises.
Now that we have successfully integrated our Python script to Fauna, let us list the steps to follow to create our Educational recommendation Application.
Getting alternatives or close substitutes for some certain educational material is an issue to a certain individual. This difficulty becomes more real when it involves a third party doing the purchase. Most parents who buy educational materials for their children solely rely on other people’s recommendation. But making such purchases online can be further enhanced using a Recommendation Model which is a machine learning algorithm that can be used to recommend educational movies, educational videos on youtube and reliable sources to access these materials.
Basically, there are three basics of recommendation model which are:
- Simple recommender.
- Content-based recommender.
- Collaborative filtering engines.
In this article, I will be using the content-based recommendation model which suggests similar items based on a particular item. This system uses item metadata, such as genre, movie description, lead acts to make these recommendations. The general idea behind these recommender systems is that if a person likes a particular item, he or she will also like an item that is similar to it. And to recommend that, it will make use of the user's past item metadata.
In building this algorithm (model), I am going to use the Python programming language because of its robust libraries and the fact that it is faster and more accurate and precise when it comes to scientific calculations. For building, this classification model is going to use the Cosine Similarity matrix which takes the distance between points let say point X1, X2, X3, X4...........Xn and check if the distance between two certain points is close if there are close, it will group them together as related (positive) but if there are far apart (negative) then it will group them as dissimilar or unrelated.
I am going to be building these models with a movie data-set which I got from UCI which is a machine learning repository for the data source.
Let’s go through the process of building the recommendation model in python, using some specific libraries.
In building this model the first thing you have to do is to build your environment using this command Python -m venv venv, after that you install the following packages using pip install for windows user while for Linux user you use pip3 install:
E. Jupyter notebook/Jupyter lab
The sklearn is used for scientific calculation, pandas are used for loading the data into the notebook in other for me to be able to work with, Matplotlib is used for plotting of graphs, Numpy is being used for mathematical computation while Jupyter notebook/jupyter lab is an environment where the python code will be written.
The dataset files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. This dataset captures feature points like overview, plot, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts, and vote averages.
These feature points could be potentially used to train machine learning models for content and collaborative filtering.
This dataset consists of the following files:
movies_metadata.csv: This file contains information on approximately 45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, genre, revenue, overview, release dates, languages, production countries, and companies. And this will be the dataset I will be using for building the recommendation model.
The Full MovieLens Dataset comprises 26 million ratings and 750,000 tag applications, from 270,000 users on all the 45,000 movies in this dataset. It can be accessed from GroupLens website.
**Snippet of libraries**
To load the dataset, I will be using the pandas DataFrame library. The panda’s library is mainly used for data manipulation and analysis. It represents your data in a row-column format. Pandas library is backed by the NumPy array for the implementation of pandas data objects. pandas offer off-the-shelf data structures and operations for manipulating numerical tables, time series, imagery, and natural language processing datasets. Basically, pandas are useful for those datasets which can be easily represented in a tabular fashion.
Next, I check for missing values in my data-set in other to avoid bias in our models. From the code snippet below there are missing values in our dataset. In other to avoid the models not be bias we deal with the missing values by replacing them with the mean and mode of the column.
Fig 1.2: Checking for missing values
Fig 1.3: Fixing missing values the mean and mode
In this section, I will be building a system that recommends movies that are similar to a particular movie. To achieve this, I will compute pairwise cosine similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score threshold.
The plot description is available as the overview feature in my movie dataset. Let's inspect the plots of a few movies:
Fig 1.4: Overview
We have to deal with Natural Language Processing problem. And it is not possible to compute the similarity between any two overviews in their raw forms.** To do this I have to compute the word to vectors of each overview or document, as it will be called. **
As the name suggests, word vectors are vectorized representations of words in a document. The vectors carry a semantic meaning with them. For example, man & king will have vector representations close to each other while man & woman would have representation far from each other.
I will be computing Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each document. This will give you a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document), and each column represents a movie, as before.
The TF-IDF score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that frequently occur in plot overviews and, therefore, their significance in computing the final similarity score.
In this case, am going to use the sci-kit-learn library which has a built-in TF-IDFVectorizer class that produces the TF-IDF matrix. In other to make the algorithm work properly without bias we remove words that are not relevant to the topic example of such words includes like, the, an, on, etc.
**A Snippet of TF-IDF Matrix**
Fig 1.5: TF-IDF matrix.
From the above output, you observe that 22645 different vocabularies or words in your dataset have 5157 movies. With the TFI-DF matrix, it can now be easier for me to compute the cosine similarity. which calculates a numeric quantity that denotes the similarity between two movies. Mathematically it can be express as
Since I have used the TF-IDF vectorizer, calculating the dot product between each vector will directly give me the cosine similarity score. Therefore, I will use sklearn’s linear_kernel() instead of cosine_similarities() since it is faster. This as a result would return a matrix of shape 5157x5157, which means each movie overview cosine similarity score with every other movie overview. Hence, each movie will be a 1x5157 column vector where each column will be a similarity score with each movie.
Fig 1.6: Computation of TF-IDF matrix
I am now going to define a function that takes in a movie title as an input and outputs a list of the 10 most similar movies. In order to do this, I need a reverse mapping of movie titles and DataFrame indices. In other words, I need a mechanism to identify the index of a movie in the Movie DataFrame, given its title.
**A snippet of the Recommendation model**
Fig 1.7: Recommendation model
After getting the function, you then saved it using joblib and generate the requirements using the command pip freeze > requirements.txt. This will help in avoiding some environmental variable issues while deploying the model on any hosting site, also it will make it easier for other machine learning engineers to contribute and improve on the model.
In this article, we built an educational recommendation application with Python and listed some steps to follow to install Fauna serverless database. We saw how easy it is to integrate Fauna into a Python application. We also gave a brief introduction to Machine Learning and Artificial Intelligence.
If you have any questions, don't hesitate to contact me on Twitter: @PAideloje
Authored in connection with the Write With Fauna program