Natural language processing (NLP) and deep learning are growing in popularity for their use in ML technologies like self-driving cars and speech recognition software.
As more companies begin to implement deep learning components and other machine learning practices, the demand for software developers and data scientists with proficiency in deep learning is skyrocketing.
Today, we will introduce you to a popular deep learning project, the Text Generator, to familiarize you with important, industry-standard NLP concepts, including Markov chains.
By the end of this article, you'll understand how to build a Text Generator component for search engine systems and the know-how to implement Markov chains for faster predictive models.
Here’s what we’ll cover today:
- Introduction to the Text Generator Project
- What are Markov chains?
- Text Generation Project Implementation
- What to learn next
Learn how to build 12 industry-standard NLP projects.
Build real-world NLP and deep learning applications with the most popular machine learning tools: NumPy, Matplotlib, scikit-learn, Tensorflow, and more.
Building Advanced Deep Learning and NLP Projects
Introduction to the Text Generator Project
Text generation is popular across the board and in every industry, especially for the mobile, app, and data science. Even journalism uses text generation to aid writing processes.
You’ve probably encountered text generation technology in your day-to-day life. iMessage text completion, Google search, and Google’s Smart Compose on Gmail are just a few examples. These skills are valuable for any aspiring data scientist.
Today, we are going to build a text generator using Markov chains. This will be a character-based model that takes the previous character of the chain and generates the next letter in the sequence.
By training our program with sample words, our text generator will learn common patterns in character order. The text generator will then apply these patterns to the input, an incomplete word, and output the character with the highest probability to complete that word.
Let’s suppose we have a string,
monke
. We need to find the character that is best suited after the charactere
in the wordmonke
based on our training corpus.Our text generator would determine that
y
is sometimes aftere
and would form a completed word. In other words, we are going to generate the next character for that given string.
The text generator project relies on text generation, a subdivision of natural language processing that predicts and generates the next characters based on previously observed patterns in language.
Without NLP, we'd have to create a table of all words in the English language and match the passed string to an existing word. There are two problems with this approach.
- It would be very slow to search thousands of words
- The generator could only complete words that it had seen before.
NLP allows us to dramatically cut runtime and increase versatility because the generator can complete words it hasn’t even encountered before. NLP can be expanded to predict words, phrases, or sentences if needed!
For this project, we will specifically be using Markov chains to complete our text. Markov processes are the basis for many NLP projects involving written language and simulating samples from complex distributions.
Markov processes are so powerful that they can be used to generate superficially real-looking text with only a sample document.
What are Markov Chains?
A Markov chain is a stochastic process that models a sequence of events in which the probability of each event depends on the state of the previous event. The model requires a finite set of states with fixed conditional probabilities of moving from one state to another
The probability of each shift depends only on the previous state of the model, not the entire history of events.
For example, imagine you wanted to build a Markov chain model to predict weather conditions.
We have two states in this model, sunny
or rainy
. There is a higher probability (70%) that it'll be sunny
tomorrow if we've been in the sunny
state today. The same is true for rainy
, if it has been rainy it will most likely continue to rain.
However, it's possible (30%) that the weather will shift states, so we also include that in our Markov chain model.
The Markov chain is a perfect model for our text generator because our model will predict the next character using only the previous character. The advantage of using a Markov chain is that it's accurate, light on memory (only stores 1 previous state), and fast to execute.
Text Generation Project Implementation
We'll complete our text generator project in 6 steps:
- Generate the lookup table: Create a table to record word frequency
- Convert frequency to probability: Convert our findings to a usable form
- Load the dataset: Load and utilize a training set
- Build the Markov chains: Use probabilities create chains for each word and character
- Sample our data: Create a function to sample individual sections of the corpus
- Generate text: Test our model
1. Generate the lookup table
First, we'll create a table that records the occurrences of each character state within our training corpus.
We will save the last ‘K’ characters and the ‘K+1’ character from the training corpus and save them in a lookup table.
For example, imagine our training corpus contained, "the man was, they, then, the, the".
Then the number of occurrences by word would be:
- "the" - 3
- "then" - 1
- "they" - 1
- "man" - 1
Here's what that would look like in a lookup table:
In the example above, we have taken K = 3
. Therefore, we'll consider 3
characters at a time and take the next character (K+1
) as our output character.
In the above lookup table, we have the word (X
) as the
and the output character (Y
) as a single space (" "
). We have also calculated how many times this sequence occurs in our dataset, 3
in this case.
We'll find this data for each word in the corpus to generate all possible pairs of X
and Y
within the dataset.
Here's how we'd generate a lookup table in code:
def generateTable(data,k=4):
T = {}
for i in range(len(data)-k):
X = data[i:i+k]
Y = data[i+k]
#print("X %s and Y %s "%(X,Y))
if T.get(X) is None:
T[X] = {}
T[X][Y] = 1
else:
if T[X].get(Y) is None:
T[X][Y] = 1
else:
T[X][Y] += 1
return T
T = generateTable("hello hello helli")
print(T)
Explanation
On line 3, we created a dictionary that is going to store our
X
and its correspondingY
and frequency value. Try running the above code and see the output.From line 9 to line 17, we checked for the occurrence of
X
andY
, and, if we already have theX
andY
pair in our lookup dictionary, then we just increment it by 1.
2. Convert frequencies to probabilities
Once we have this table and the occurrences, we'll generate the probability that an occurrence of Y will appear after an occurrence of a given X. Our equation for this will be:
For example, if X = the
and Y = n
our equation would look like this:
- Frequency that
Y = n
whenX = the
: 2 - Total frequency in the table: 8
- Therefore: P = 2/8 = 0.125 = 12.5%
Here's how we'd apply this equation to convert our lookup table to probabilities usable with Markov chains:
def convertFreqIntoProb(T):
for kx in T.keys():
s = float(sum(T[kx].values()))
for k in T[kx].keys():
T[kx][k] = T[kx][k]/s
return T
T = convertFreqIntoProb(T)
print(T)
Explanation
- We summed up the frequency values for a particular key and then divided each frequency value of that key by that summed value to get our probabilities. Simple logic!
3. Load the dataset
Next, we'll load our real training corpus, you can use any long text (.txt
) doc that you want.
We'll use a political speech to provide enough words to teach our model.
text_path = "train_corpus.txt"
def load_text(filename):
with open(filename,encoding='utf8') as f:
return f.read().lower()
text = load_text(text_path)
print('Loaded the dataset.')
This data set will give our generator enough occurrences to make reasonably accurate predictions. As with all machine learning, larger training corpora will result in more accurate predictions.
4. Build the Markov chains
Now let's construct our Markov chains and associate the probabilities with each character. We'll use the generateTable()
and convertFreqIntoProb()
functions created in step 1 and step 2 to build the Markov models.
def MarkovChain(text,k=4):
T = generateTable(text,k)
T = convertFreqIntoProb(T)
return T
model = MarkovChain(text)
print('Model Created Successfully!')
Explanation
On line 1, we created a method to generate the Markov model. This method accepts the text corpus and the value of
K
, which is the value telling the Markov model to considerK
characters and predict the next character.On line 2, we generated our lookup table by providing the text corpus and K to our method,
generateTable()
, which we created in the previous lesson.On line 3, we converted the frequencies into the probabilistic values by using the method,
convertFreqIntoProb()
, which we also created in the previous lesson.
5. Sample the text
Now, we'll create a sampling function that takes the unfinished word (ctx
), the Markov chains model from step 4 (model
), and the number of characters used to form the word's base (k
).
We'll use this function to sample passed context and return the next likely character with the probability it is the correct character.
import numpy as np
def sample_next(ctx,model,k):
ctx = ctx[-k:]
if model.get(ctx) is None:
return " "
possible_Chars = list(model[ctx].keys())
possible_values = list(model[ctx].values())
print(possible_Chars)
print(possible_values)
return np.random.choice(possible_Chars,p=possible_values)
sample_next("commo",model,4)
Explanation
The function,
sample_next(ctx,model,k)
, accepts three parameters: the context, the model, and the value ofK
.The
ctx
is nothing but the text that will be used to generate some new text. However, only the lastK
characters from the context will be used by the model to predict the next character in the sequence.For example, we passed the value of context as
commo
and value ofK = 4
, so the context, which the model will look to generate the next character, is ofK
characters long and hence, it will beommo
because the Markov models only take the previous history. You can see the value of the context variable by printing it too.On line 9 and 10, we printed the possible characters and their probability values, which are also present in our model. We got the next predicted character as
n
, and its probability is1.0
. It makes sense because the wordcommo
is more likely to becommon
after generating the next character.On line 12, we returned a sampled character according to the probabilistic values as we discussed above.
6. Generate text
Finally, we'll combine all the above functions to generate some text.
def generateText(starting_sent,k=4,maxLen=1000):
sentence = starting_sent
ctx = starting_sent[-k:]
for ix in range(maxLen):
next_prediction = sample_next(ctx,model,k)
sentence += next_prediction
ctx = sentence[-k:]
return sentence
print("Function Created Successfully!")
text = generateText("dear",k=4,maxLen=2000)
print(text)
Explanation
- The above function takes in three parameters: the starting word from which you want to generate the text, the value of K, and the maximum length of characters up to which you need the text.
- If you run the code, you'll get a speech that starts with "dear" and has a total of 2000 characters.
While the speech likely doesn't make much sense, the words are all fully formed and generally mimic familiar patterns in words.
What to learn next
Congratulations on completing this text generation project. You now have hands-on experience with Natural Language Processing and Markov chain models to use as you continue your deep learning journey.
Your next steps are to adapt the project to produce more understandable output or to try some more awesome machine learning projects like:
- Pokemon classification system
- Emoji predictor using NLP
- Text decryption using recurrent neural network
To walk you through these projects and more, Educative has created Building Advanced Deep Learning and NLP Projects. This course gives you the chance to practice advanced deep learning concepts as you complete interesting and unique projects like the one we did today. By the end, you'll have the experience to use any of the top deep learning algorithms on your own projects.
Happy learning!
Top comments (0)