loading...

Natural Language Processing... in the Browser???

charlesdlandau profile image Charles Landau Originally published at charlesdlandau.io ・6 min read

Not too long ago I was looking for a way to explore React Hooks and Material UI v4 and just generally brush up on some frontend basics as they are in current year. I came to JS by way of Python and I thought to myself "hey maybe I can npm install some of the data-sciency stuff I'm used to pip install-ing." Maybe I could take the boring practice problem of a chat client and spice it up with some natural language processing.

Bad idea

Anyway it turns out you can, even if it's not the best idea. In this post I'm going to:

  1. Briefly introduce core concepts
  2. Show how compromise.js enables us to do some basic NLP in a React app.
  3. Cover pros and cons of this approach

You can see a demo using a bare-bones React chat client here: https://chatter-nlp.charlesdlandau.net.

And you can see the source code for demo here: https://github.com/CharlesDLandau/chatter_nlp

Here's a capture of it in the messaging view:

Message view

And here is the analysis view:

Analysis view

1. Core Concepts

Natural Language Processing (NLP) tries to extract meaning, semantics, sentiment, tags, named entities, and more from text. I'm oversimplifying but I have a good excuse I swear. Chatbots, speech recognition, and search are some of the use cases for NLP.

Tags in NLP represent parts of speech like "verb" or "article", but you can also call more specific designations a tag, like "WeekDay". Compromise ships with a nice set of tags https://observablehq.com/@spencermountain/compromise-tags and extensibility for adding new ones.

Corpus is the body of text being analyzed. For example, if you were doing NLP and analysis on a book (or the complete works of so-and-so), that is your corpus. Some corpora are purpose-made and might be pre-tagged.

Documents are each unit of text being analyzed. For example, in the demo chat app, each message constitutes a document.

TF-IDF is a method for weighting the meaning of words in a document. The measure is "highest when the term occurs many times within a small number of documents". To calculate it, you need the corpus, and you need to select a specific term in a specific document.

2. Show and tell

Mostly, the demo app is responsible for passing around a messages array. The array gets initialized at the top of the component hierarchy, used for NLP processes, and parsed for dataviz.

const App = (props) => {

  const [messages, setMessages] = useState(dummyMessages)
  const [user, setUser] = useState("red");

  // Append new messages in a user-aware manner
  const mountMessage = (e, contents) =>{
      var text = contents
      var msgUpdate = messages
      msgUpdate.push({text:text, time: new Date().toLocaleString(),
      author:user})
      setMessages(msgUpdate)
      if (user === 'red'){
        setUser('blue')
      }else{
        setUser('red')
      }
      e.preventDefault()
    }


  return (...);

}

In this demo I didn't really care about the logic of multiple users, or named users, or really anything other than having two users, so "red" and "blue" pass around the user state, and messages contain pretty much all the data we care about.

Eventually, all the analysis happens in a class TextAnalysis, which receives the messages array.

import nlp from 'compromise';

class TextAnalysis{
    constructor(docs){
        this.docs = docs
        this.mergedDocs = nlp(
            this.docs.map(obj => obj.text).join()
        )
    }
...

};

Mostly, TextAnalysis is consumed via its .cardData method, which returns hardcoded objects like:

{
  title: "Parts of Speech",
  chartData: {
    labels: ["Noun", "Verb", "Adjective"],
    series:[
    this.mergedDocs.match('#Noun'
      ).out('array').length,
    this.mergedDocs.match('#Verb'
        ).out('array').length,
    this.mergedDocs.match('#Adjective'
        ).out('array').length
    ]},
  chartType: 'Pie',
  chartOpts: {
    chartPadding: 30,
    labelOffset: 30,
    labelDirection: 'explode'
  }
}

What's going on here?

compromise analyzed all the text from all the messages in the constructor and stored it in this.mergedDocs. So, many of the methods of a compromise object are exposed by this.mergedDocs, including .match() for matching tags.

We can populate the chartData with the number of matches for parts of speech:

[
this.mergedDocs.match('#Noun'
  ).out('array').length,
this.mergedDocs.match('#Verb'
    ).out('array').length,
this.mergedDocs.match('#Adjective'
    ).out('array').length
]

Note the .out method exposed by compromise, this is typically how we extract parsed data from analyzed documents. It supports parsing to text, arrays, html, normalized text, and even csv among others.

These and chartOpts and chartType get passed on to Chartist, which we're using for dataviz.

// Parses a single object from TextAnalysis.cardData()
function AnalysisCard(props){
  var { data } = props
  const classes = useStyles();

  return (
    <Grid item>
        <Card className={classes.card}>


        <CardHeader className={classes.cardHead} title={
          <Typography style={
            {textOverflow:'ellipsis', whiteSpace:'nowrap'}
          }
           variant='subtitle2'>
          {data.title}</Typography>
        } />


        <ChartistGraph
        data={data.chartData}
        type={data.chartType}
        options={data.chartOpts} />
        </Card>
    </Grid>
  )
}

That's all it took!

...almost. Compromise doesn't seem to ship with a TF-IDF vectorizer (I'm spoiled by Scipy). So, within TextAnalysis we can implement our own...

tf(d, occ){
  // Takes a document and N occurrences of a term
  // Returns the term frequency (tf)
  // tf = (occurrences of search term/N terms)
  return (occ/nlp(d.text).terms().out('array').length)
}

idf(t){
  // Takes a term
  // Returns the inverse document frequency (idf)
  // idf = log_e(N documents/N documents containing
  // the search term)

  var nDocs = this.docs.length
  var nMatches = this.docs.filter(
    doc=>{
      var matched = doc.text.match(t)
      if(matched){
        return true}
      else{
        return false}
      }
  ).length

  var result = nDocs / nMatches
  if (!isFinite(result)){
    return 0
  }else{
  return Math.log(result)
  }
}

tfIdf(doc){
  // Takes a document from this.docs
  // Returns a sorted array of objects in the form:
  // {term:<String>, weight:<Float>}
  // This is a vector of terms and Tf-Idf weights



  var tfIdfVector = nlp(doc.text).terms().out('freq').map((d)=>{
    var t = d['normal']


    var tf = this.tf(doc, d['count'])

    var idf = this.idf(t)

    return {term: t, weight:tf*idf}
    }
  )

  var sortedTfIdfVector = tfIdfVector.sort((obj0, obj1)=>{
    var w0 = obj0.weight
    var w1 = obj1.weight
    if (w0 < w1){
      return 1
    }
    if (w0 > w1){
      return -1
    }
    return 0
  })

  return sortedTfIdfVector

}

(This felt more than a little hacky, so if anybody critiques my implementation that would be quite welcome.)

With that, we can also chart the top weighted words for a random message!

Plotted TFIDF

Pros and cons

I don't know if you should do this, or at least if you do this you should really think hard about why.

Cons

  1. You're using the user's browser to do the analysis. The same browser that's serving them that beautiful user experience you've been slaving over.
  2. Compromise is ~200kb and the lead author says you probably can't shake that tree.
  3. Is data preprocessing already a goal for the frontend? Is your organization going to make it one? Does this require dropping a bunch of code from your team into a codebase mostly maintained by another team? Have you taken their temperature about that yet?
  4. One of the benefits of doing preprocessing in the backend is that you can operate on your whole dataset -- in the browser we can only calculate TFIDF using the messages in the browser, in the backend we could get a more useful weight using all the messages.

Pros

  1. You're using the user's browser to do the analysis. Maybe that analysis costs a lot to run on the public cloud or elsewhere...
  2. All the insights can be fed back into client and shared with the user (e.g. the analysis view in our demo).
  3. More analysis in the browser means you could potentially find a way to do more filtering in the browser, ultimately leading to fewer calls to your API.
Further reading:

Compromise: https://github.com/spencermountain/compromise
Chartist: https://gionkunz.github.io/chartist-js
Demo source: https://github.com/CharlesDLandau/chatter_nlp

Feedback welcome!

I took on this mini-project as a way to experiment with something funky. I'm sharing it here because I'm interested in people's reactions and to always learn more. Thanks for reading!

Posted on by:

charlesdlandau profile

Charles Landau

@charlesdlandau

Data Scientist | Sr. Consultant at Guidehouse

Discussion

pic
Editor guide