DEV Community

Cover image for Text Prediction using Bigrams and Markov Models
Marco Steinke
Marco Steinke

Posted on

Text Prediction using Bigrams and Markov Models

1. Introduction:

For the implementation of text prediction I am using the concept of Markov Models, which allows me to calculate the probabilities of consecutively events. I will first explain what a Markov Model is.

2. Hidden Markov Model:

To explain a Markov Model it is important to start by understanding what a Markov Chain is.

2.1 Markov-Chain

A Markov Chain is a stochastic model, which models a sequence of random variables. It is assumed that states in the future only depend on a limited amount of previous states (Markov Adoption)

Let q1, q2, …​, qn be a series of states. With the Markov Adoption you can assume:

markovchain

So instead of spectating the probabilities of all previous states of the series, you may only spectate the previous state and end up with the same result.

3. Text prediction:

After introducing the idea of hidden Markov models and Markov chains I can now proceed and presentate the appliance of this concept.

3.1 N-grams:

Let the next sentence be the ongoing example for this section.

"On a rainy day please carry an umbrella."

Following the idea of Markov chains, you are interested in the following probabilities:

  • P( a | On )

  • P( rainy | a )

  • P( day | rainy )

  • P( please | day )

  • P( carry | please )

  • P( an | carry )

  • P( umbrella | an )

Therefore you will have to understand the structure of each probability:

P(day, rainy) defines the probability of the word "day" following the word "rainy" in a given input. You can use the following synonyms:

P(day,rainy) → next = day, previous = rainy.

To get this probability, you will have to count the amount of the value of previous in the given input and then count how often the value of next is following the value of previous in the given input.

The quotient of them gives you the probability as follows:

P(next, previous = (|next| / |previous|).

This will get clear using the following quote:

"To be or not to be"

What is the probability P(be, to) ?

Ignoring the casing of the words (you could format all words to lowercase beforehand) you will find "to" being 2 times in the given input and "be" being following "to" 2 times, so you get the probability P(be, to) = 2/2 = 1.

Following this scheme you can now calculate the probabilites for all pairs of words.

If you use this type of probabilities where exactly two words are required to calculate a probability, then you are using Bigrams or 2-grams.

You can generalize this idea by defining N-grams, where a N-gram is the probability which you can again find in the Markov chain:

markovchain
In the text example, you would find a probability such as:

P(be, To be or not to) = 1

where your next-value is a single word and the previous-value is a sequence of words with the length n-1.

3.2 Application of N-grams to text prediction:

Now to predict text you will need the following components:

3.2.1 Input:

You will need any sort of textarea or predefined string (for texts it may be a multiline string or a list). This input has to be processed and stored, such that any algorithm can iterate through the result and built bigrams or N-grams from it.

In this implementation I read the input from a textarea and defined an array of separators.

static SEPARATORS = ['.', ',', '!', '?', '', "\n"];
Those separators will be used to determine when a sentence or paragraph is ending and there shall not be instantiated a bigram from the input at the given location.

Example:
"One apple please. I also want a bag."

This would lead to the construction of the bigram (I, please) with P(I | please), which is not really included in the given input.

To avoid such wrong bigrams I am checking each word for not being a separator.

After checking for separators you can iterate through the text and process the input:

static formatInput(input) {
        // (1)
        Bigram.input = [];

        Bigram.input = input
            .replaceAll("^", "$")
            .replaceAll(", ", ",")
            .replaceAll(". ", ".")
            .replaceAll("! ", "!")
            .replaceAll("? ", "?")
            .replaceAll(",", "^,^")
            .replaceAll(".", "^.^")
            .replaceAll("!", "^!^")
            .replaceAll("?", "^?^")
            .replaceAll("\n ", "\n")
            .replaceAll("\n", "^\n^")
            .replaceAll(" ", "^")
            .split("^");

        // (2)
        let i = 1;
        while(i < Bigram.input.length) {
            if(Bigram.isSeparator(Bigram.input[Bigram.input.length - 1])) {
                Bigram.input.pop();
            }
            i++;
        }
    }
Enter fullscreen mode Exit fullscreen mode

In the first part of the method (1) I am replacing all separators with spaces around them with the exact same separator, but without spaces. Then I replace the separator with "^^" around it, so I can later split the input and remain the separators in my processed data.

Afterwards I am checking the end of the data for a separator (2), because I do not want to have any unexpected behaviour of my algorithm when coming to the end, where a separator is the last piece of input and no other word is following it.

This would result in the following:

Input: "To be, or not to be."

Output (1):

["to", "be", ",", "or", "not", "to", "be", "."]
Enter fullscreen mode Exit fullscreen mode

And after (2) has finished, it would return:

Output (2):

["to", "be", ",", "or", "not", "to", "be"]
Enter fullscreen mode Exit fullscreen mode

So now I received the input as list with the remaining separators inside of the text, but not the unnecessary separator at the end of the input.

3.2.2 Successors and the input:

To construct bigrams from the input, I first wanted to count each word and store the amount in a map. The map would have looked like:

const wordCountMap = new Map();
wordCountMap.set("One", 3);
Enter fullscreen mode Exit fullscreen mode

This would map each word to its amount.

But with this concept I looked into the future of the implementation and realized I throw away a very important information when only storing the amount of each word.

BECAUSE: I would still have to iterate through the processed input over and over again to check each word for its successor.
This led me to the idea of replacing my original problem by a equivalent algorithmic problem, which only requires numbers and no information about the words.

Problem of successing indices:
Given two sets of integers, find the amount of integers from the first set which are numerically successors of any integers from the second set.

Example: Given A = {4, 11, 19, 27} and B = {5, 20}

You would iterate through A and for each integer from A, you would look if there is any integer in B which is an successor of the current integer from A. If yes, increment a counter and continue with the next integer from A, if no, directly continue with the next integer from A.

In this case it would lead to a counter of 2, because B[0] = 5 = 4 + 1 = A[0] + 1 and B[1] = 20 = 19 + 1 = A[2] + 1.

If you now divide this counter by the size (cardinality) of A, you now would get the probability of B including a successor of an integer from A.

This numerical problem can now be represented by using the location (index) of words in a text (list).

3.2.3 Constructing Bigrams:

To construct bigrams you would have iterate through the list and in each iteration step look at the current value of the iterator and its successor, or to talk in code:

    static generateBigrams() {
    let existingBigramHashes = [];
    Bigram.bigrams = [];
    if(Bigram.hasWordsCounted()) {

    for(let i = 0; i < Bigram.input.length - 1; i++) {
        if(!(Bigram.SEPARATORS.includes(Bigram.input[i+1]) || Bigram.SEPARATORS.includes(Bigram.input[i]))) {
            if(!existingBigramHashes.includes(Bigram.input[i+1] + Bigram.input[i])) {
                Bigram.bigrams.push(new Bigram(Bigram.input[i+1], Bigram.input[i]));
                existingBigramHashes.push(Bigram.input[i+1] + Bigram.input[i]);
            }
        }
    }

}
Enter fullscreen mode Exit fullscreen mode

This piece of code basically looks at Bigram.input[i] and Bigram.input[i+1] and constructs a bigram from it, using the following constructor:

constructor(next, previous) {
    this.next = next;
    this.previous = previous;
    // Will be explained later !
    this.findProbability();
}
Enter fullscreen mode Exit fullscreen mode

It also checks for duplicate bigrams by comparing the current input[i] and input[i+1] to the ones from all previously constructed bigrams. Lastly it checks for bigrams to not include any of the separators.

Now after bigrams are constructed, there will only be the task of computing the probabilities for all bigrams.

3.2.4 Computing the probabilities (or solving the index successor problem):

At this point you already know instead of counting the amount of the previous word of a bigram and then iterate through the text and count how often the next word of a bigram is following the previous one, you can also solve the already defined index successor problem.

Now instead of counting the words I want to map each word to the indices at which the word occurs in the input:

static countWords() {
    Bigram.wordCountMap.clear();
    let i = 0;
    if(Bigram.hasInput() && Bigram.isFormatted()) {
        Bigram.input.forEach(
            (word) => {
                if(!Bigram.wordCountMap.has(word)) {
                    Bigram.wordCountMap.set(word, [i++]);
                } else {
                    let tmpArr = Bigram.wordCountMap.get(word);
                    tmpArr.push(i++);
                    Bigram.wordCountMap.set(word, tmpArr);
                }
            }
        )
    }
}
Enter fullscreen mode Exit fullscreen mode

In general, this method iterates through the processed input and checks if a certain word is already stored in the map and then follows two different cases:

if the word is not in the map yet, put it in the map and store a list with the current position in the text as value.

if the word already is in the map, load the list from the map and add the current position to it.

Let the following be the input:

"One Ring to rule them all, One Ring to find them, One Ring to bring them all, and in the darkness bind them"

For the given input you would then receive the following map:

Map {'One' => Array(3), 'Ring' => Array(3), 'to' => Array(3), 'rule' => Array(1), 'them' => Array(4), }
[[Entries]]
0: {"One" => Array(3)}
key: "One"
value: (3) [0, 7, 15]
1: {"Ring" => Array(3)}
key: "Ring"
value: (3) [1, 8, 16]
2: {"to" => Array(3)}
key: "to"
value: (3) [2, 9, 17]
3: {"rule" => Array(1)}
key: "rule"
value: [3]
4: {"them" => Array(4)}
key: "them"
value: (4) [4, 11, 19, 27]
5: {"all" => Array(2)}
key: "all"
value: (2) [5, 20]
6: {"find" => Array(1)}
key: "find"
value: [10]
8: {"bring" => Array(1)}
key: "bring"
value: [18]
9: {"and" => Array(1)}
key: "and"
value: [22]
10: {"in" => Array(1)}
key: "in"
value: [23]
11: {"the" => Array(1)}
key: "the"
value: [24]
12: {"darkness" => Array(1)}
key: "darkness"
value: [25]
13: {"bind" => Array(1)}
key: "bind"
value: [26]
Enter fullscreen mode Exit fullscreen mode

And as you can see there is an array including the positions (indices) for each word of the input.

To calculate the probabilites for all bigrams, we now have to iterate through the list of all bigrams which were already constructed and run the algorithm which I have explained before on each bigram.

So for each bigram run:

 findProbability() {
    let sum = 0;
    Bigram.wordCountMap.get(this.previous).forEach(
        (index) => {
            if(Bigram.wordCountMap.get(this.next).includes(index+1)) {
                sum++;
            }
        }
    )

    this.probability = sum / Bigram.wordCountMap.get(this.previous).length;
}
Enter fullscreen mode Exit fullscreen mode

This will, as already explained before, take the array of indices from the previous-value and then count in how many of the cases there is an successor in the indices array from the next-value for the current integer from the previous-value array.

The final result can be seen in the following probabilities:

  • P(Ring | One) = 1

  • P(to | Ring) = 1

  • P(rule | to) = 0.3333333333333333

  • P(them | rule) = 1

  • P(all | them) = 0.5

  • P(find | to) = 0.3333333333333333

  • P(them | find) = 1

  • P(bring | to) = 0.3333333333333333

  • P(them | bring) = 1

  • P(in | and) = 1

  • P(the | in) = 1

  • P(darkness | the) = 1

  • P(bind | darkness) = 1

  • P(them | bind) = 1'

You have now learned how to predict the probability of a word following another word by a given input.
In the next part you will see how all of this can be used to generate text by using those probabilities!

If you want to have a look at the actual implementation, feel free to look at the files included in this directory or to visit:

https://bestofcode.net/Applications/text-prediction !

Source-code can be found here: https://github.com/MarcoSteinke/Machine-Learning-Concepts/tree/main/implementation/1.%20text-prediction

Thank you :)

Top comments (0)