DEV Community

Cover image for Build a Deep Learning Text Generator Project with Markov Chains
Ryan Thelin for Educative

Posted on • Originally published at educative.io

Build a Deep Learning Text Generator Project with Markov Chains

Natural language processing (NLP) and deep learning are growing in popularity for their use in ML technologies like self-driving cars and speech recognition software.

As more companies begin to implement deep learning components and other machine learning practices, the demand for software developers and data scientists with proficiency in deep learning is skyrocketing.

Today, we will introduce you to a popular deep learning project, the Text Generator, to familiarize you with important, industry-standard NLP concepts, including Markov chains.

By the end of this article, you'll understand how to build a Text Generator component for search engine systems and the know-how to implement Markov chains for faster predictive models.

Here’s what we’ll cover today:

Learn how to build 12 industry-standard NLP projects.

Build real-world NLP and deep learning applications with the most popular machine learning tools: NumPy, Matplotlib, scikit-learn, Tensorflow, and more.

Building Advanced Deep Learning and NLP Projects

Introduction to the Text Generator Project

Text generation is popular across the board and in every industry, especially for the mobile, app, and data science. Even journalism uses text generation to aid writing processes.

You’ve probably encountered text generation technology in your day-to-day life. iMessage text completion, Google search, and Google’s Smart Compose on Gmail are just a few examples. These skills are valuable for any aspiring data scientist.

Today, we are going to build a text generator using Markov chains. This will be a character-based model that takes the previous character of the chain and generates the next letter in the sequence.

By training our program with sample words, our text generator will learn common patterns in character order. The text generator will then apply these patterns to the input, an incomplete word, and output the character with the highest probability to complete that word.

Let’s suppose we have a string, monke. We need to find the character that is best suited after the character e in the word monke based on our training corpus.

Our text generator would determine that y is sometimes after e and would form a completed word. In other words, we are going to generate the next character for that given string.

Alt Text

The text generator project relies on text generation, a subdivision of natural language processing that predicts and generates the next characters based on previously observed patterns in language.

Without NLP, we'd have to create a table of all words in the English language and match the passed string to an existing word. There are two problems with this approach.

  • It would be very slow to search thousands of words
    • The generator could only complete words that it had seen before.

NLP allows us to dramatically cut runtime and increase versatility because the generator can complete words it hasn’t even encountered before. NLP can be expanded to predict words, phrases, or sentences if needed!

For this project, we will specifically be using Markov chains to complete our text. Markov processes are the basis for many NLP projects involving written language and simulating samples from complex distributions.

Markov processes are so powerful that they can be used to generate superficially real-looking text with only a sample document.

What are Markov Chains?

A Markov chain is a stochastic process that models a sequence of events in which the probability of each event depends on the state of the previous event. The model requires a finite set of states with fixed conditional probabilities of moving from one state to another

The probability of each shift depends only on the previous state of the model, not the entire history of events.

For example, imagine you wanted to build a Markov chain model to predict weather conditions.

We have two states in this model, sunny or rainy. There is a higher probability (70%) that it'll be sunny tomorrow if we've been in the sunny state today. The same is true for rainy, if it has been rainy it will most likely continue to rain.

However, it's possible (30%) that the weather will shift states, so we also include that in our Markov chain model.

Alt Text

The Markov chain is a perfect model for our text generator because our model will predict the next character using only the previous character. The advantage of using a Markov chain is that it's accurate, light on memory (only stores 1 previous state), and fast to execute.

Text Generation Project Implementation

We'll complete our text generator project in 6 steps:

  1. Generate the lookup table: Create a table to record word frequency
  2. Convert frequency to probability: Convert our findings to a usable form
  3. Load the dataset: Load and utilize a training set
  4. Build the Markov chains: Use probabilities create chains for each word and character
  5. Sample our data: Create a function to sample individual sections of the corpus
  6. Generate text: Test our model

Alt Text

1. Generate the lookup table

First, we'll create a table that records the occurrences of each character state within our training corpus.
We will save the last ‘K’ characters and the ‘K+1’ character from the training corpus and save them in a lookup table.

For example, imagine our training corpus contained, "the man was, they, then, the, the".
Then the number of occurrences by word would be:

  • "the" - 3
  • "then" - 1
  • "they" - 1
  • "man" - 1

Here's what that would look like in a lookup table:

Alt Text

In the example above, we have taken K = 3. Therefore, we'll consider 3 characters at a time and take the next character (K+1) as our output character.

In the above lookup table, we have the word (X) as the and the output character (Y) as a single space (" "). We have also calculated how many times this sequence occurs in our dataset, 3 in this case.

We'll find this data for each word in the corpus to generate all possible pairs of X and Y within the dataset.

Here's how we'd generate a lookup table in code:

def generateTable(data,k=4):

    T = {}
    for i in range(len(data)-k):
        X = data[i:i+k]
        Y = data[i+k]
        #print("X  %s and Y %s  "%(X,Y))

        if T.get(X) is None:
            T[X] = {}
            T[X][Y] = 1
        else:
            if T[X].get(Y) is None:
                T[X][Y] = 1
            else:
                T[X][Y] += 1

    return T

T = generateTable("hello hello helli")
print(T)
Enter fullscreen mode Exit fullscreen mode

Explanation

  • On line 3, we created a dictionary that is going to store our X and its corresponding Y and frequency value. Try running the above code and see the output.

  • From line 9 to line 17, we checked for the occurrence of X and Y, and, if we already have the X and Y pair in our lookup dictionary, then we just increment it by 1.

2. Convert frequencies to probabilities

Once we have this table and the occurrences, we'll generate the probability that an occurrence of Y will appear after an occurrence of a given X. Our equation for this will be:

(Frequency of Y with X)/(Sum of Total Frequencies)

For example, if X = the and Y = n our equation would look like this:

  • Frequency that Y = n when X = the: 2
  • Total frequency in the table: 8
  • Therefore: P = 2/8 = 0.125 = 12.5%

Here's how we'd apply this equation to convert our lookup table to probabilities usable with Markov chains:

def convertFreqIntoProb(T):     
    for kx in T.keys():
        s = float(sum(T[kx].values()))
        for k in T[kx].keys():
            T[kx][k] = T[kx][k]/s

    return T

T = convertFreqIntoProb(T)
print(T)
Enter fullscreen mode Exit fullscreen mode

Explanation

  • We summed up the frequency values for a particular key and then divided each frequency value of that key by that summed value to get our probabilities. Simple logic!

3. Load the dataset

Next, we'll load our real training corpus, you can use any long text (.txt) doc that you want.

We'll use a political speech to provide enough words to teach our model.

text_path = "train_corpus.txt"
def load_text(filename):
    with open(filename,encoding='utf8') as f:
        return f.read().lower()

text = load_text(text_path)
print('Loaded the dataset.')
Enter fullscreen mode Exit fullscreen mode

This data set will give our generator enough occurrences to make reasonably accurate predictions. As with all machine learning, larger training corpora will result in more accurate predictions.

4. Build the Markov chains

Now let's construct our Markov chains and associate the probabilities with each character. We'll use the generateTable() and convertFreqIntoProb() functions created in step 1 and step 2 to build the Markov models.

def MarkovChain(text,k=4):
    T = generateTable(text,k)
    T = convertFreqIntoProb(T)
    return T

model = MarkovChain(text)
print('Model Created Successfully!')
Enter fullscreen mode Exit fullscreen mode

Explanation

  • On line 1, we created a method to generate the Markov model. This method accepts the text corpus and the value of K, which is the value telling the Markov model to consider K characters and predict the next character.

  • On line 2, we generated our lookup table by providing the text corpus and K to our method, generateTable(), which we created in the previous lesson.

  • On line 3, we converted the frequencies into the probabilistic values by using the method, convertFreqIntoProb(), which we also created in the previous lesson.

5. Sample the text

Now, we'll create a sampling function that takes the unfinished word (ctx), the Markov chains model from step 4 (model), and the number of characters used to form the word's base (k).

We'll use this function to sample passed context and return the next likely character with the probability it is the correct character.

import numpy as np

def sample_next(ctx,model,k):

    ctx = ctx[-k:]
    if model.get(ctx) is None:
        return " "
    possible_Chars = list(model[ctx].keys())
    possible_values = list(model[ctx].values())

    print(possible_Chars)
    print(possible_values)

    return np.random.choice(possible_Chars,p=possible_values)

sample_next("commo",model,4)
Enter fullscreen mode Exit fullscreen mode

Explanation

  • The function, sample_next(ctx,model,k), accepts three parameters: the context, the model, and the value of K.

  • The ctx is nothing but the text that will be used to generate some new text. However, only the last K characters from the context will be used by the model to predict the next character in the sequence.

  • For example, we passed the value of context as commo and value of K = 4, so the context, which the model will look to generate the next character, is of K characters long and hence, it will be ommo because the Markov models only take the previous history. You can see the value of the context variable by printing it too.

  • On line 9 and 10, we printed the possible characters and their probability values, which are also present in our model. We got the next predicted character as n, and its probability is 1.0. It makes sense because the word commo is more likely to be common after generating the next character.

  • On line 12, we returned a sampled character according to the probabilistic values as we discussed above.

6. Generate text

Finally, we'll combine all the above functions to generate some text.


def generateText(starting_sent,k=4,maxLen=1000):

    sentence = starting_sent
    ctx = starting_sent[-k:]

    for ix in range(maxLen):
        next_prediction = sample_next(ctx,model,k)
        sentence += next_prediction
        ctx = sentence[-k:]
    return sentence

print("Function Created Successfully!")

text = generateText("dear",k=4,maxLen=2000)
print(text)
Enter fullscreen mode Exit fullscreen mode

Explanation

  • The above function takes in three parameters: the starting word from which you want to generate the text, the value of K, and the maximum length of characters up to which you need the text.
  • If you run the code, you'll get a speech that starts with "dear" and has a total of 2000 characters.

While the speech likely doesn't make much sense, the words are all fully formed and generally mimic familiar patterns in words.

What to learn next

Congratulations on completing this text generation project. You now have hands-on experience with Natural Language Processing and Markov chain models to use as you continue your deep learning journey.

Your next steps are to adapt the project to produce more understandable output or to try some more awesome machine learning projects like:

  • Pokemon classification system
  • Emoji predictor using NLP
  • Text decryption using recurrent neural network

To walk you through these projects and more, Educative has created Building Advanced Deep Learning and NLP Projects. This course gives you the chance to practice advanced deep learning concepts as you complete interesting and unique projects like the one we did today. By the end, you'll have the experience to use any of the top deep learning algorithms on your own projects.

Happy learning!

Continue reading about NLP and Machine Learning

Top comments (0)