Here's an implementation of the IMDB sentiment model from the book with some changes, I'll try and explain what I know and what I did, apologies for my code 😓.
- Machine learning essentially uses math equations which can be tweaked to do shallow things that humans can do very easily.
- Deep learning is machine learning with neural networks, like layers of meshes lying one on top of another. Each network is again, a math equation, multiple layers of neural networks are well multiple math equations on top of another.
- Each networks has weights associated with it, like a coefficient in an equation. Like when you say y = ax + b, then a is the weight associated with x. Don't worry about it too much as Andrew Ng says.😊
- When you put these network adjacent to each other and train them using data they form what everyone in the ML field call as an internal representation.
- These internal representations is what the networks see based on the world of data which they have been trained on. Think of yourself as a caterpillar who's crawling along the floor and all your life, all you know is that there is something beneath you on which you crawl, even though you've never seen it, even though you don't know that it is and you obviously don't care about the humans who built it but you know mentally that If I crawl on this I'll be able to move.
- In the same way this model is trying to learn if sentiments are positive or negative. Remember it cannot actually feel or understand the sentiments but with the help of labelled data this is possible.
- Labelling is the cornerstone of supervised learning (part of machine learning where models learn by showing examples of something). It's like showing a kid what a car is then they're able to identify cars like experts, machine learning models though are obviously way behind in that general sense.
What's Keras btw? : As far as I understand from the book and online articles, it's an API specification which calls lower level functions of other ML libraries such as Tensorflow, CNTK or Theano. To say simply, it makes it a lot easier for anyone to use lower level libraries without getting bogged down by the details. At a given point of time it can use any one of those and you won't have to change anything in your code on the top! A super naive view would be Keras is your front end and Tensorflow/CNTK/Theano are your backends :)
import numpy as np # for numerical arrays from keras import models #book told me to do it ! :) from keras import layers from keras.datasets import imdb from keras import optimizers import matplotlib.pyplot as plt # for plotting
Keras allows you to directly import data into your program without hassle.
(train_data, train_labels),(test_data, test_labels) = imdb.load_data(num_words = 10000)
And it's so sweet that it splits the data into training and testing.
Well you might have heard people throwing words around like overfitting, underfitting to sound cool? Well this is what they're talking about. When you get a bunch of rows, thousand of rows (yes think of it as excel rows and columns) of data, the best idea is to divide it into training and testing. The training data trains the models (neural network layers) and the testing data is used to see if the model actually is any good. You'll see more on this later.
Let's see a simple table below:
So here the labels are the last column, Can Code?, more importantly when you ask the model to look at these, you need to make them binary, i.e. No for 0 and 1 for Yes.
Let's split this data:
Training Data: train_data would be columns User & Age, train_labels would be Can Code?
Testing Data: test_data would be columns User & Age, test_labels would be Can Code?
In our case of IMDB comment sentiment analysis, the data is like this:
|Row1 of word indices||1|
|Row N-1 of word indices||0|
|Row 50,000 of word indices||1|
Word indices is an array which holds the frequency number of a word which appears in a comment.
Say for example: "Shawshank Redemption is probably the best movie ever", this would be translated to the following based on the frequency of the words in the english corpus as follows:
The last value of 1 is actually saying this is a positive comment, by a human who has labelled it, in this case it's your truly :). You can try this yourself if you have keras installed as follows:
from keras.datasets import imdb s = "Shawshank Redemption is probably the best movie ever" for word in s.split(' '): print(word_index.get(word, '?'))
Yes, you're right and so we use something called one hot encoding to convert this mess of words into a standard length of an array of 10,000 elements, remember this?:
num_words = 10000
This limits the comments to have only those words which have a frequency of more than 10,000, i.e. they are used most often.
x_train = np.zeros([len(train_data),10000]) for number, sequence in enumerate(train_data): x_train[number, sequence] = 1
And you do this with the test data as well, since you'd need to maintain consistency when you test the model which works on arrays of 10,000 elements right?
#in the book it is done as a function...but like right now.....ignore it x_test = np.zeros([len(test_data),10000]) for number, sequence in enumerate(test_data): x_test[number, sequence] = 1
Vectorize (or make them into numpy arrays) the labels as well:
#vectorize labels y_train = np.asarray(train_labels).astype('float32') y_test = np.asarray(test_labels).astype('float32')
Now comes a part where we split the training data as well! Yeah, but why do that?
What's over or underfit?
If a model is overfit, it means it's sticking so close to the data it trained on, that it cannot identify/classify/work on anything outside of what it encountered during training, just like saying "I now know what a Honda Civic is" but then when you see any different car you just say "that's Honda Civic! And that's Honda Civic and that's Honda Civic!"
On the other hand, when in an underfit scenario, it's going to be "I now know what a Honda Civic is" but then whenever you see a new car you say "well what is that? It's not a Honda Civic! Is it even a car? What is it?".
And so let's do some simple math here:
Keras IMDB data gives us 50,000 rows or samples.
25,000 went to training --> 15,000 would go into actually training those neural networks and the rest 10,000 would go into validation. Validation essentially refers to using training derived data to tune the model, to make it WORK, whenever we make some changes and train the model again on those 15,000 samples we check the result on the validation set of 10,000 samples.
Remember we never touched the rest of the 25,000 which are into the test set, not touching it at all! And that's the whole point of machine learning: how to use data generously to train a model to generalize while at the same time preventing overfitting and underfitting, therefore the test data will never have any contribution towards training the model.
- Here we are, we will define the model and inputs that go into it.
- I have added some for loops into the model creation area of the program to test different types of activation functions, loss functions, hidden layers and hidden units.
Let me show you:
#model iterations activations = ['relu', 'tanh'] loss_func = ['mse', 'binary_crossentropy'] hidden_layer_units = [16, 32, 64] hidden_layers = [1,2,3] for activation in activations: for function in loss_func: for units in hidden_layer_units: for lyrs in hidden_layers:
Relu or tanh/sigmoid which are rectified linear units and hyperbolic functions which put some non linear(non linear is fancy way of saying not on a straight line) vibe into the math equations we're using. These are nothing but operations on the neural network mesh to make those internal representations and primarily in the end learn a good representation of the input data. Again don't worry about it too much, I just went ahead and used them!
Imagine your guitar tutor telling you that the riff you just played was not even close to how Slash plays it! You're furious, it sounds pretty good to you, but not to the tutor! So how do you tell if you're improving? Well one way is to record yourself playing and listening to it and then using that feedback to improve so you could reduce the difference between you and Slash. That's what a loss function is calculating. It's trying to minimize the difference between what it predicts and what the actual label is.
In this case, it would be:
Comment : "Shawshank redemption is the bomb yo!" Label : 1
Comment : "Shawshank redemption is the bomb yo!" Prediction : 0.85
As you see this model has probably been trained well and it's prediction is about 0.85 or 85%, that is leaning to 1.
These items are to be covered in the book in more depth! 😂 But they form the deep part of deep learning. Say you have 3 layers:
Layer 1 : Input layer which accepts that 10,000 element array
Layer 2 : A middle (hidden)layer with an activation function, takes input from layer 1 of 10,000 elements and spits out 16 element output.
Layer 3 : An output layer with an activation function, takes input from layer 2 of 16 elements and spits out a 1 element output.
In the above case we have 1 hidden layer with 16 units.
The only reason I added those for loop was to try all the activations, loss functions etc. Loop over all of them to find the best fit as you see below.
We define the model, if you see the three layers above, the following lines are doing just that. The for loop in between just adds layers on different iterations which may return a better prediction.
#print(activation, function, units, lyrs); #write model model = models.Sequential() model.add(layers.Dense(16, activation = 'relu', input_shape = (10000,))) for i in range(0,lyrs): model.add(layers.Dense(units, activation = activation)) model.add(layers.Dense(1, activation = 'sigmoid')) #optimzers, loss etc. model.compile(optimizer = optimizers.RMSprop(lr = 0.001), loss = function, metrics = ['accuracy'])
The last line is adding another beast, the optimizer, which if you take the guitar example is like saying how will you reduce the difference between yourself and Slash, by:
- Practicing 24 hours a day?
- Or, selling your soul to the devil?
- Or, asking Slash to play in your place?
To put in plain english the optimizer works on the loss function to reduce the error rate of the model and hence increase the accuracy.
history = model.fit(x_train_partial, y_train_partial, epochs = 10, batch_size = 512, validation_data=(validation_x_train, validation_y_train))
This line above is where you see the magic happen on the terminal! THE ACTUAL TRAINING, try it out, it's oddly satisfying! 😌
Well the model training part is half the stuff done, the part is looking at the results, the accuracy, whether the model overfits or underfits etc and then changing your parameters. Any of the following:
- Learning rate
- Activation function
- Epochs - how many passes you make over the training data
- Hidden layers
- Hidden units
- Loss function
And that's where this article ends, I still haven't found the right combination, although I think it's a simpler problem, considering it's the first in the book!
Here is one of the outputs from the model, one of the 36 combinations:
Here is the code:
You might have to change the paths to save the images and the models.