DEV Community

Erik Hoffman
Erik Hoffman

Posted on

Generate tag suggestions from a text

In most of the scenarios when you should publish a text or any other media, you need to set tags for visibility and for it to be found in search and browsing. To set this tags may not always be the simplest - not only by the editor itself, but in many scenarios the text should be published and tagged by another person than the one who was writing it.

What if we could help out with some suggestions for tags?

Theory

If we assume that the valuable words, which we want to tag the text with, are among the most used words in the text, except from prepositions and such smaller in between words.

Let's filter out the most used words from the text!

Get started

Let's say that we have a text - in this case I'll use my latest blog post Light and Safe with git hooks and partial testing and will refer to it in the example below as TEXT_INPUT.

First we want to get all words out of it, one by one, instead of being in sentences. Let's split on the RegExp \s metacharacter which will find all whitespaces in the text if so a space, new line, tab or any other whitespace there is.

const wordsArray = splitByWords(TEXT_INPUT);

function splitByWords(text) {
  return text.split(/\s+/);
}

The result will look something like this

[ 'Why?',
  'To',
  'write',
  'tests',
  'for',
  'your',
  'code,',
  'as',
  'well',
  'as',
  'following',
  'a',
  'code',
  'design',
  'pattern,',
  'is',
  'crucial',
  'parts',
  'of',
  'building',
  'a',
  'scalable',
  'and',
  'stable',
  'code',
  'base',
  'and',
  'deliver',
  'on',
  'the',
  'continuous',
  'integration,',
  'deployment',
  ... 500 more items ]

I.e. just a list of words, as promised.

Now let's count how many times each words exists in the text by iterating the array, adding each word as an object key with the number of presences as its value.

// As input we have the array created in the earlier code block
const wordsMap = createWordMap(wordsArray);

function createWordMap(wordsArray) {
  // This object will store the result during, and after, the iteration
  const wordsMap = {};
  // Let's iterate the array, sending in each word into the anonymous function
  wordsArray.forEach(function(key) {
    // If the word is already in the storing object, we'll add up on its presence number.
    // Else we just add it with its first presence, #1
    if (wordsMap.hasOwnProperty(key)) {
      wordsMap[key]++;
    } else {
      wordsMap[key] = 1;
    }
  });
  return wordsMap;
}

Now we have a giant object with all words, where they all have a count of presences. Something like this

{ 
  'Why?': 1,
  To: 2,
  write: 1,
  tests: 4,
  for: 6,
  your: 4,
  'code,': 1,
  as: 7,
  well: 2,
  following: 1,
  a: 11,
  code: 9,
  design: 1,
  'pattern,': 1,
  is: 8,
  crucial: 1,
  ...and more
}

Better but you still need to find the ones with most presence. let's start by filtering out smaller words, which quite often is prepositions and such - filtering in the same method as before.

// As input we have the array created in the earlier code block
const wordsMap = createWordMap(wordsArray);

function createWordMap(wordsArray) {
  const wordsMap = {};
  wordsArray.forEach(function(key) {
    // Let's start with handling different appearences of the same word, by normalizing them - removing commas, capitalizing etc
    key = key
      .trim()
      .toLowerCase()
      .replace(".", "")
      .replace(",", "")
      .replace("!", "");
    // Then filter by length to remove the short, often to be, prepositions
    if (key.length <= 5) return;
    // Then keep on as before
    if (wordsMap.hasOwnProperty(key)) {
      wordsMap[key]++;
    } else {
      wordsMap[key] = 1;
    }
  });
  return wordsMap;
}

The result out of this is a better list, like this

{
  safest: 1,
  implement: 1,
  should: 4,
  before: 1,
  commit: 5,
  broken: 2,
  integrated: 1,
  origin: 1,
  process: 1,
  struggling: 1,
  looking: 2,
  documentation: 1,
  fortunately: 1,
  community: 1,
  around: 1,
  javascript: 1,
  packages: 1,
  ...and more
}

Now let's sort them to have the most popular on top

// The unsorted list as input, wordsMap
const sortedWordsArray = sortByCount(wordsMap);

function sortByCount(wordsMap) {
  // This array will store our list as we'll now create an array of sorted objects
  var finalWordsArray = [];
  // Iterate all the keys in the word list object sent in, to map each key:alue row to an object in itself, to add to our array
  finalWordsArray = Object.keys(wordsMap).map(function(key) {
    return {
      name: key, // the word itself
      total: wordsMap[key] // the value
    };
  });

  // Now let's sort the array so the object with most appearances get in top
  finalWordsArray.sort(function(a, b) {
    return b.total - a.total;
  });

  return finalWordsArray;
}

The result will be something like this

[ 
  { name: 'lint-staged', total: 6 },
  { name: 'commit', total: 5 },
  { name: 'eslint', total: 5 },
  { name: '"hooks":', total: 4 },
  { name: '"pre-commit":', total: 4 },
  { name: '"husky":', total: 4 },
  { name: 'should', total: 4 },
  { name: 'install', total: 4 },
  { name: 'entire', total: 3 },
  { name: 'packagejson', total: 3 },
  ...and more
]

A lot more relevant!

What can we improve?

When filtering the array of words, it would of course be greatly improved if we could get some list word words to ignore, rather than doing the assumption that all short words should be removed. I still haven't found a reliable source for this though.

We could possibly using some kind of NLP, Natural language processing to find out, though that would lock ut down to use this only at english texts and not being language agnostic. It would also build on the complexity quite a lot.

This is part of the reason I would say that we generate suggestions rather than generating tags

  • We can't guarantee that the most used words are relevant
  • You might want to have broader tags, even if the generated ones are relevant (I added continuous integration and deployment for example) for the text content as such.

So what's the use case?

A text is quite readable and as the publisher it might be easy and relevant to read the text instead. But an applicable scenario might be to analyze a subtitle file for a video or transcription of a podcast - and out of that generate tags to make the media more searchable without watching or listening through the entire asset, making notes of tags to be set.

What can you see as a use case?
What do you think can be improved?

Discussion (0)