DEV Community

Jared Fraser
Jared Fraser

Posted on

ML classification on short strings in Ruby using Smalltext

I have a ruby application that pulls in various blog and forum posts from the internet (like a game dev tracker) and needed a way to classify them. Some simple rules got me 75% of the way there - does it include the word "patch" or "patch notes"? then it's likely a patch notes or announcement post - but that wasn't completely reliable. For example, a post reporting a bug on a patch might be worded as "Empty inventory because of Patch 9.0.0.2076", or a post might be announcing a patch deployment but not include any notes.

Recently I found the gem smalltext which has solved most of my problems. It classifies quickly and is pretty simple to train.

I built a training script that hardcodes training data into categories, then trains and saves a model.

lib/train_classifier.rb

require 'smalltext'

data = {
  patch_notes: [
    "PC Client Update - 12/7/2018",
    "v6.31 Patch Notes",
  ],

  news: [
    "$1,000,000 Winter Royale + Tournament Updates",
    "Pop-Up Cups Update 12/5/2018",
  ],
}


s = Smalltext::Classifier.new
data.each do |classification, titles|
  titles.each do |title|
    s.add_item(classification.to_s, title)
  end
end

s.train
s.save_model("classifications.model")

I update this training script with definite examples and incorrectly classified posts as time goes on to improve the model.

app/services/classifier.rb

class Classifier
  def self.classify(text, threshold = 0.99)

    # Load the classifier and the trained model
    @classifier ||
      ((@classifier = Smalltext::Classifier.new) && @classifier.load_model('classifications.model'))

    classified = @classifier.classify(text).to_h
    # { "patch_notes" => 0.9998900049414214 }

    # Return the first category that meets the threshold
    result = classified.detect {|k,v| v >= threshold }
    result.present? ? result.first : nil
  end
end

This class gives my application an entry point to the classifier. I only care about reasonably certain classifications so I built this to only output a classification if the result is above a certain threshold. In my application, I either get a category or nil, which makes it easier for me to handle.

> Classifier.classify("V7.00 PATCH NOTES")

sentence: V7.00 PATCH NOTES
classification: [["patch_notes", 0.9998900049414214]]

=> "patch_notes"

Top comments (1)

Collapse
 
kokizzu profile image
Kiswono Prayogo

thanks a lot man