Petros Demetrakopoulos

Posted on Apr 3, 2020

Generating Beatles-like lyrics with RNNs

#machinelearning #python #ai #tensorflow

The project is implemented in Python using Tensorflow.
It is a word-based Recurrent neural network (RNN) inspired by this Tensorflow tutorial.
The model is trained in a corpus containing the song titles and the lyrics of many famous songs of Beatles.
So, given a sequence of words from Beatles' lyrics, it can predict the next words.
We could say that it is a model generating Beatle-like lyrics.

The corpus (aka training dataset)

As we said the corpus (beatles_lyrics.txt) contains the song titles and the lyrics of many Beatles songs in the following format:

Song title 1
-----------------
Lyric line 1
Lyric line 2
Lyric line 3
Lyric line 4
etc...

Song title 2
-----------------
Lyric line 1
Lyric line 2
Lyric line 3
Lyric line 4
etc...

etc...

I have manually created the corpus with the lyrics I found in this amazing site

Data preprocessing

As almost every machine learning model, training data need a bit of preprocessing before they are ready to be used as a training input for our RNN.
We preprocess the initial corpus by doing the following tasks:

Convert all letters to lowercase
Remove blank lines
Remove special characters (such as ',' , '(' , ')' , '[' , ']' etc)

The following function performs the preprocessing:

stopChars = [',','(',')','.','-','[',']','"']
# preprocessing the corpus by converting all letters to lowercase, 
# replacing blank lines with blank string and removing special characters
def preprocessText(text):
  text = text.replace('\n', ' ').replace('\t','')
  processedText = text.lower()
  for char in stopChars:
    processedText = processedText.replace(char,'')
  return processedText

After the text preprocessing step, we need to convert it to a list of words.
This procedure is also known as "Tokenization".

The following function performs the tokenization:

 def corpusToList(corpus):
  corpusList = [w for w in corpus.split(' ')] 
  corpusList = [i for i in corpusList if i] #removing empty strings from list
  return corpusList

Then, we trim each word for leading or trailing spaces / tabs.

map(str.strip, corpus_words) # trim words

Now, it is time to find the unique words (aka vocabulary) from which our dataset is composed of.

vocab = sorted(set(corpus_words))

In order to train our model, we need to represent words with numbers. So we map a specific number to each unique word of our corpus and vice versa by creating the following lookup tables. Then we represent the whole corpus as a list of numbers (word_as_int).

print('Corpus length (in words):', len(corpus_words))
print('Unique words in corpus: {}'.format(len(vocab)))
word2idx = {u: i for i, u in enumerate(vocab)}
idx2words = np.array(vocab)
word_as_int = np.array([word2idx[c] for c in corpus_words])

The prediction process

Our goal is to predict the next words that will follow in a sequence, given some starting words (a start sequence).
In layman's terms, RNNs are able to maintain an internal state that depends on the elements (in our case elements = sequences of words) that the RNN has previously "seen".
So, we train the RNN to take as an input a sequence of words and predict the output, which is the following word at each time step. As you can easily understand, if we run the model for many time steps we generate sequences of words!

In order to train it, we have to split our train dataset (aka corpus) in "batches" of sequences of words (as this is what we also want to predict). Then, we need to shuffle them, because we want to make the order with which the songs have been placed in the dataset indifferent for the RNN (and thus for the prediction it will do). If we do not shuffle them, RNN may learn the order of the songs in the corpus to and that may lead it to overfitting

Creating training batches

Now it is time to slice the corpus into training batches. Each batch should contain seqLength words from the corpus.
For each split sequence of words, there is also a target sequence which has the same length as the training one and it is the same but one word shifted to the right. So, we slice the text into seqLength+1 words slices and we use the first seqLength words as training sequence and we extract the target sequence as mentioned.

Example:
Let's say our corpus contains the following verse:

I read the news today oh boy
About a lucky man who made the grade

Now, if the seqLength is 14, the training sequence will be :

I read the news today oh boy
About a lucky man who made the

and the target sequence will be:

read the news today oh boy
About a lucky man who made the grade.

We do so with the following lines:

# The maximum length sentence we want for a single input in words
seqLength = 10
examples_per_epoch = len(corpus_words)//(seqLength + 1) # number of seqLength+1 sequences in the corpus

# Create training / targets batches
wordDataset = tf.data.Dataset.from_tensor_slices(word_as_int)
sequencesOfWords = wordDataset.batch(seqLength + 1, drop_remainder=True) # generating batches of 10 words each, typically converting list of words (sequence) to string

def split_input_target(chunk): # This is where right shift happens
  input_text = chunk[:-1]
  target_text = chunk[1:]
  return input_text, target_text # returns training and target sequence for each batch

dataset = sequencesOfWords.map(split_input_target) # dataset now contains a training and a target sequence for each 10 word slice of the corpus

Shuffling the batches

As we mentioned earlier, before we feed our training batches in our RNN, we have to shuffle them to prevent the RNN from learning the order of the songs in the corpus which may lead it to overfitting.

BATCH_SIZE = 64 # each batch contains 64 sequences. Each sequence contains 10 words (seqLength)
BUFFER_SIZE = 100 # Number of batches that will be processed concurrently

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset now contains batches of 64 word sequence each, each sequence is filled in the previous step with 10 words.

The model

Our RNN is composed of 3 layers:

Input layer. It maps the number representing each word to a vector with known dimensions (that are explicitly set)
GRU (middle) layer: GRU stands for Gated Recurrent Units. The number of units that this layer contains is also explicitly set. This layer could also be replaced by a Long Short-Term Memory (LSTM) layer. More on LSTMs and GRUs in this useful link
Output layer: It has as many units as the size of the vocabulary

The model definition code:

# Length of the vocabulary in words
vocab_size = len(vocab)
# The embedding dimension
embedding_dim = 256
# Number of GRU units
rnn_units = 1024

def createModel(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.GRU(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
 return model

model = createModel(vocab_size = len(vocab), embedding_dim=embedding_dim, rnn_units=rnn_units, batch_size=BATCH_SIZE)

How the RNN works

For each word in the input layer, the model passes its embedding to the GRU layer for one step of time.
The output of the GRU is then passed to the dense layer which predicts the log-likelihood of the next word.

The schematic below is a bit more descriptive.

Training the model

From now on, we can consider the problem as a simple classification problem.
If you think about it, our model predicts the "class" of the next word based on its current state and the input words during a time step.

So, as in every classification model, in order to train it we need to calculate the loss in each time step.

We do so by defining the following function:

def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

Then we compile the model using 'adam' oprimizer.

model.compile(optimizer='adam', loss=loss)

During the training process, we should save checkpoints of the training in a directory that we have manually created

checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

Now it is time to execute training.
We explicitly set the number of epochs.
At this point I would like to remind you that an epoch is one forward pass and one backward pass of all the training examples.

We choose to train for 20 epochs. Note that as you increase number of epochs, the training time will increase too.
You should experiment with this number in order to fine tune the model.
But be careful, training for too many epochs may lead to overfitting.

EPOCHS = 20
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

After the training process is over, we restore the trained model form the latest checkpoint.
It is time to generate the lyrics!

tf.train.latest_checkpoint(checkpoint_dir)
model = createModel(len(vocab), embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))
model.summary()

Generating the lyrics

RNNs (as most of Neural network types in general) need an initial state to start predicting.
In our case, this initialization is represented by a starting string with which we want the generated lyrics to start.
The model generates the probability distribution of the next word using the start string and the RNN state.

Then, with the help of categorical distribution, the index of the predicted word is calculated and the predicted word is used as the input for the next time step of the model

The state that the RNN returns is then fed back to the input of the RNN, in order to help it by providing more context (not just one word). This process continues as it generates predictions and this is why it learns better while it gets more context from the predicted words.

The following function performs the task mentioned above:

def generateLyrics(model, startString, temp):
  print("---- Generating lyrics starting with '" + startString + "' ----")
  # Number of words to generate
  num_generate = 30

  # Converting our start string to numbers (vectorizing)
  start_string_list =  [w for w in startString.split(' ')]
  input_eval = [word2idx[s] for s in start_string_list]
  input_eval = tf.expand_dims(input_eval, 0)

  text_generated = []

  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # temp represent how 'conservative' the predictions are. 
      # Lower temp leads to more predictable (or correct) lyrics
      predictions = predictions / temp 
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)
      text_generated.append(' ' + idx2words[predicted_id])
  return (startString + ''.join(text_generated))

As you can see, many factors may influence the accuracy of the predictions.
temp parameter for example represents how 'conservative' the predictions are.
This means that lower temp values lead to more predictable (or correct) lyrics.

Running the model

model.py also contains a "demo part".
After the training process is finished, it saves the model in a binary file (you can then restore it in one line of code and use it instantly to predict values) for future use so we do not have to train it every time we want to generate lyrics.

#save trained model for future use (so we do not have to train it every time we want to generate text)
model.save('saved_model.h5')

Then it calls the generateLyrics function with the start string "love" (We all know how much Beatles used) this word in their songs.
Then it prompts the user to enter a start string and a temp value to generate lyrics.

Some examples that the model gave me:

Start string: "love"
Generated lyrics: "love youyouyouyou as i write this letter send my love to give you all my loving i will send to you all my loving i will send to you don't bother"

Start string: "boys and girls"
temp: 0.4
Generated lyrics: "boys and girls make me sing and shout that georgia's always be blind love is here to stay and that's enough to make you my girl be the only one love me hold"

Start string: "day"
temp: 0.8
Generated lyrics: "day tripper night at my own it will take a walk on home loretta get back get back get back get back get back get back get back get back get"

`
Hm... Not that bad.

Optimizing the model

You can easily fine-tune the model by changing the variables that influence the accuracy of the predictions. Using a stopwords set with the very common words in the corpus and removing them from the training data may help the predictions as it will eliminate the noise and the features that do not offer a significant info gain.

BATCH_SIZE, temp, embedding dimensions, sequence length can all influence the prediction and thus the generated lyrics.
You can also try to use LSTM units in the middle layer.
So you can experiment yourself and comment with the results.

Project is available on this Github Repo