DEV Community

Cover image for Rust Keyword Extraction: Creating the Co-Occurrence Matrix algorithm from scratch
Afonso Barracha
Afonso Barracha

Posted on

Rust Keyword Extraction: Creating the Co-Occurrence Matrix algorithm from scratch

Disclosure: The material in this Tutorial has not been reviewed, endorsed, or approved of by the Rust Foundation. For more information on the Rust Foundation Trademark Policy, click here.

Series Intro

You know, like many of you, I was blown away by Chat GPT, especially GPT-4, which reignited my love for machine learning after a two-year break. But once I was back, I hit the same issue that made me quit: slow training times.

I tried unsupervised NLP models and keyword extraction, but even those were too slow for a simple VPS. That is when I realised Rust and Actix-Web could be the answer.

Therefore, let's explore Rust for NLP together and dive into the awesome world of text mining and keyword extraction.

What will I cover in this series

I will mainly cover 3 algorithms:

  1. TF-IDF: Term Frequency-Inverse Document Frequency algorithm converts raw text into insightful numerical data.
  2. Co-occurrence Matrix: the Co-occurrence Matrix covers underlying patterns and associations in a language by examining word pairs and building semantic networks in textual data.
  3. RAKE: Rapid Automatic Keyword Extraction Algorithm identifies crucial words and phrases to enhance text summarization, topic modeling, and information retrieval in any text-based project.

Introduction

Before we dig into implementing the Co-Occurence Matrix algorithm, it's necessary to grasp its concept. This algorithm generates a matrix detailing the frequency of word pairs occurring together within a specified context size across one or multiple documents.

Steps to implement Co-Occurrence Matrix

Note: if you're short on time and can't read the entire article, all the code is available in this repository. Also, consider buying me a coffee if you find my content helpful.

The Co-Occurrence Matrix algorithm is straightforward to implement to some extent:

  1. Begin by creating a hashmap to index all words;
  2. Calculate the matrix:
    • Split each document into vectors of individual words;
    • Iterate over each word to compute the context;
    • Increment the cor-occurrence count for each word index;
    • Normalize the matrix by dividing each value by the maximum value found in the matrix.

These steps outline a basic approach to constructing a Co-Occurrence Matrix, an invaluable tool in text analysis and natural language processing for understanding the relationships between words in a corpus.

Implementation

Start by creating a file called co_occurence.rs and add the following struct:

use std::collections::HashMap;

pub struct CoOccurrence {
    matrix: Vec<Vec<f32>>,
    words: Vec<String>,
    words_indexes: HashMap<String, usize>,
}
Enter fullscreen mode Exit fullscreen mode

Helper functions

Create a function to create the word index, where it takes a reference to the vector of words, enumerating through the words and assigning its index as the numeric key of the word:

// ...

fn create_words_indexes(words: &[String]) -> HashMap<String, usize> {
    words
        .iter()
        .enumerate()
        .map(|(i, w)| (w.to_string(), i))
        .collect::<HashMap<String, usize>>()
}
Enter fullscreen mode Exit fullscreen mode

Create a function to calculate the context by determining the window range:

use std::{collections::HashMap, ops::Range};
// ...

fn get_window_range(window_size: usize, index: usize, words_length: usize) -> Range<usize> {
    let window_start = index.saturating_sub(window_size);
    let window_end = (index + window_size + 1).min(words_length);
    window_start..window_end
}
Enter fullscreen mode Exit fullscreen mode

Creating the Matrix

To create the matrix, we need the following parameters:

  • documents: the documents to be analyzed;
  • word_indexes: the hash map of the word indexes;
  • length: the number of words passed to the matrix;
  • window_size: the size of the context.

Create the function to initialize the matrix with zeros::

// ...

fn get_matrix(
    documents: &[String],
    words_indexes: &HashMap<String, usize>,
    length: usize,
    window_size: usize,
) -> Vec<Vec<f32>> {
    let mut matrix = vec![vec![0.0_f32; length]; length];
    // ...
}
Enter fullscreen mode Exit fullscreen mode

Iterate through each document and split it into words:

// ...

fn get_matrix(
    documents: &[String],
    words_indexes: &HashMap<String, usize>,
    length: usize,
    window_size: usize,
) -> Vec<Vec<f32>> {
    // ...


    documents.iter().for_each(|doc| {
        let doc_words = doc.split_whitespace().collect::<Vec<&str>>();
        // ...
    });
}
Enter fullscreen mode Exit fullscreen mode

Note: we separate the document's words into a distinct variable instead of using flat_map directly because we require access to the entire vector of words. This access is essential for calculating the window range (context) around each word.

Enumerate through the words to get the range:

// ...

fn get_matrix(
    documents: &[String],
    words_indexes: &HashMap<String, usize>,
    length: usize,
    window_size: usize,
) -> Vec<Vec<f32>> {
    // ...

    documents.iter().for_each(|doc| {
        // ...

        doc_words
            .iter()
            .enumerate()
            .filter_map(|(i, w)| words_indexes.get(*w).map(|first_index| (i, *first_index)))
            .for_each(|(i, first_index)| {
                get_window_range(window_size, i, doc_words.len());
            });
    });
}
Enter fullscreen mode Exit fullscreen mode

Filter and map the range to get the other word's indexes:

// ...

fn get_matrix(
    documents: &[String],
    words_indexes: &HashMap<String, usize>,
    length: usize,
    window_size: usize,
) -> Vec<Vec<f32>> {
    // ...

    documents.iter().for_each(|doc| {
        // ...

        doc_words
            // ...
            .for_each(|(i, first_index)| {
                get_window_range(window_size, i, doc_words.len())
                    .filter_map(|j| {
                         // If indexes are the same skip
                        if i == j {
                            return None;
                        }

                        doc_words
                             // Tries to get Option<&&str> of the word by index
                            .get(j)
                             // Check if the word is part of the matrix
                            .and_then(|other_word| words_indexes.get(*other_word));
            });
    });
}
Enter fullscreen mode Exit fullscreen mode

For each index calculate the value of the matrix:

fn get_matrix(
    documents: &[String],
    words_indexes: &HashMap<String, usize>,
    length: usize,
    window_size: usize,
) -> Vec<Vec<f32>> {
    // ...

    documents.iter().for_each(|doc| {
        // ...

        doc_words
            // ...
            .for_each(|(i, first_index)| {
                get_window_range(window_size, i, doc_words.len())
                    // ...
                    .for_each(|other_index| {
                        matrix[first_index][*other_index] += 1.0;
                    });
            });
    });
}
Enter fullscreen mode Exit fullscreen mode

We still miss the normalization, add a max variable at the top:

fn get_matrix(
    documents: &[String],
    words_indexes: &HashMap<String, usize>,
    length: usize,
    window_size: usize,
) -> Vec<Vec<f32>> {
    // ...
    let mut max = 0.0_f32;

   // ...
}
Enter fullscreen mode Exit fullscreen mode

Then calculate it on every matrix iteration:

fn get_matrix(
    documents: &[String],
    words_indexes: &HashMap<String, usize>,
    length: usize,
    window_size: usize,
) -> Vec<Vec<f32>> {
    // ...

    documents.iter().for_each(|doc| {
        // ...

        doc_words
            // ...
            .for_each(|(i, first_index)| {
                get_window_range(window_size, i, doc_words.len())
                    // ...
                    .for_each(|other_index| {
                        // ...
                        let current = matrix[first_index][*other_index];

                        if current > max {
                            max = current;
                        }
                    });
            });
    });
}
Enter fullscreen mode Exit fullscreen mode

Normalize the matrix by dividing each value by the maximum value found, and return it:

fn get_matrix(
    documents: &[String],
    words_indexes: &HashMap<String, usize>,
    length: usize,
    window_size: usize,
) -> Vec<Vec<f32>> {
    // ...

    matrix
        .iter_mut()
        .flat_map(|row| row.iter_mut())
        .for_each(|value| *value /= max);

    matrix
}
Enter fullscreen mode Exit fullscreen mode

Putting all together

Finally, implement a new method on the struct using the helper functions:

impl CoOccurrence {
    pub fn new(documents: Documents, words: Words, window_size: WindowSize) -> Self {
        // Create the word indexes
        let words_indexes = create_words_indexes(words);
        let length = words.len();

        Self {
            // Create the matrix
            matrix: get_matrix(documents, &words_indexes, length, window_size),
            words: words.to_vec(),
            words_indexes,
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

And create methods to access the private fields of the struct:

impl CoOccurrence {
    // ...

    /// Get the numeric label of a word.
    pub fn get_label(&self, word: &str) -> Option<usize> {
        self.words_indexes.get(word).map(|w| w.to_owned())
    }

    /// Get the word of a numeric label.
    pub fn get_word(&self, label: usize) -> Option<String> {
        self.words.get(label).map(|w| w.to_owned())
    }

    /// Get the matrix of the co-occurrence.
    pub fn get_matrix(&self) -> &Vec<Vec<f32>> {
        &self.matrix
    }

    /// Get the labels of the co-occurrence.
    pub fn get_labels(&self) -> &HashMap<String, usize> {
        &self.words_indexes
    }

    /// Get all relations of a given word.
    pub fn get_relations(&self, word: &str) -> Option<Vec<(String, f32)>> {
        let label = match self.get_label(word) {
            Some(l) => l,
            None => return None,
        };

         Some(
            self.matrix[label]
                .iter()
                .enumerate()
                .filter_map(|(i, &v)| {
                    if v > 0.0 {
                        if let Some(w) = self.get_word(i) {
                            return Some((w, v));
                        }
                    }

                    None
                })
                .collect::<Vec<(String, f32)>>(),
        )
    }

    /// Get the row of a given word.
    pub fn get_matrix_row(&self, word: &str) -> Option<Vec<f32>> {
        let label = match self.get_label(word) {
            Some(l) => l,
            None => return None,
        };
        Some(self.matrix[label].to_owned())
    }

    /// Get the relation between two words.
    pub fn get_relation(&self, word1: &str, word2: &str) -> Option<f32> {
        let label1 = match self.get_label(word1) {
            Some(l) => l,
            None => return None,
        };
        let label2 = match self.get_label(word2) {
            Some(l) => l,
            None => return None,
        };
        Some(self.matrix[label1][label2])
    }
}
Enter fullscreen mode Exit fullscreen mode

Conclusion

In this article, we've successfully created a basic implementation of the co-occurrence matrix algorithm, a powerful tool especially useful for analyzing the relationship between keywords and ranking words within documents. This algorithm plays a crucial role in developing preference algorithms, offering insights into how words coalesce to convey meaning and context.

You can access all the code discussed in this article through this repository. For integration with existing projects consider using keyword_extraction crate available on crates.io.

About the Author

Hey there! I am Afonso Barracha, a back-end developer with a soft spot for GraphQL. If you enjoyed reading this article, why not show some love by buying me a coffee?

Lately, I have been diving deep into more advanced subjects. As a result, I have switched from sharing my thoughts every week to posting once or twice a month. This way, I can make sure to bring you the highest quality content possible.

Do not miss out on any of my latest articles – follow me here on dev and LinkedIn to stay updated. I would be thrilled to welcome you to our ever-growing community! See you around!

Top comments (0)