Introduction
After looking at a lot of Java/JVM based NLP libraries listed on Awesome AI/ML/DL I decided to pick the Apache OpenNLP library. One of the reasons comes from the fact another developer (who had a look at it previously) recommended it. Besides, it’s an Apache project, they have been great supporters of F/OSS Java projects for the last two decades or so (see Wikipedia). It also goes without saying that Apache OpenNLP is backed by the Apache 2.0 license.
In addition, this tweet from an NLP researcher added some more confidence to the matter:
I’ll like to say my personal experience has been similar with Apache OpenNLP so far and I echo the simplicity and user-friendly API and design. You will see as we explore it further, that being the case.
Exploring NLP using Apache OpenNLP
Java bindings
We won’t be covering the Java API to Apache OpenNLP tool in this post but you can find a number of examples in their docs. A bit later you will also need some of the resources enlisted in the Resources section at the bottom of this post in order to progress further.
Command-line Interface
I was drawn to the simplicity of the CLI available and it just worked out-of-the-box, for instances where a model was needed, and when it was provided. It would just work without additional configuration.
To make it easier to use and also not have to remember all the CLI parameters it supports I have put together some shell scripts. Have a look at the README to get more insight into what they are and how to use them.
Getting started
You will need the following from this point forward:
- Git client 2.x or higher (an account on GitHub to fork the repo)
- Java 8 or higher (suggest install GraalVM CE 19.x or higher)
- Docker CE 19.x or higher and check it is running before going further
- Ability to run shell scripts from the CLI
- Understand reading/writing shell scripts (optional)
Note: At the time of the writing version 1.9.1 of Apache OpenNLP was available.
We have put together scripts to make these steps easy for everyone:
$ git clone git@github.com:valohai/nlp-java-jvm-example.git
or
$ git clone https://github.com/valohai/nlp-java-jvm-example.git
$ cd nlp-java-jvm-example
This will lead us to the folder with the following files in it:
LICENSE.txt
README.md
docker-runner.sh <=== only this one concerns us at startup
images
shared <=== created just when you run the container
Note: a docker image has been provided to be able to run a docker container that would contain all the tools you need to go further. You can see the *shared*
folder has been created, which is a volume mounted into your container but it’s actually a directory created on your local machine and mapped to this volume. So anything created or downloaded there will be available even after you exit out of your container!
Have a quick read of the main README file to get an idea of how to go about using the docker-runner.sh shell script, and take a quick glance at the Usage section ***as well.* Thereafter also take a look into the Apache OpenNLP README file to see the usages of the scripts provided there in.
Run the NLP Java/JVM docker container
At your local machine command prompt while at the root of the project, do this:
$ ./docker-runner.sh --runContainer
There is a chance you get this first, before you get the prompt:
Unable to find image 'neomatrix369/nlp-java:0.1' locally
0.1: Pulling from neomatrix369/nlp-java
f476d66f5408: ...
.
.
.
Digest: sha256:53b89b166d42ddfba808575731f0a7a02f06d7c47ee2bd3622e980540233dcff
Status: Downloaded newer image for neomatrix369/nlp-java:0.1
And then you will be presented with prompt inside the container:
Running container neomatrix369/nlp-java:0.1
++ pwd
+ time docker run --rm --interactive --tty --workdir /home/nlp-java --env JDK_TO_USE= --env JAVA_OPTS=<--snipped>
nlp-java@cf9d493f0722:~$
The container is packed with all the Apache OpenNLP scripts/tools you need to get started with exploring various NLP solutions.
Installing Apache OpenNLP inside the container
Here is how we go further from here when you are inside the container, at the container command-prompt:
nlp-java@cf9d493f0722:~$ cd opennlp
nlp-java@cf9d493f0722:~$ ./opennlp.sh
You will see the apache-opennlp-1.9.1-bin.tar.gz
artifact being downloaded and expanded into the shared
folder:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 10.6M 100 10.6M 0 0 4225k 0 0:00:02 0:00:02 --:--:-- 4225k
apache-opennlp-1.9.1/
apache-opennlp-1.9.1/NOTICE
apache-opennlp-1.9.1/LICENSE
apache-opennlp-1.9.1/README.html
.
.
.
apache-opennlp-1.9.1/lib/jackson-jaxrs-json-provider-2.8.4.jar
apache-opennlp-1.9.1/lib/jackson-module-jaxb-annotations-2.8.4.jar
Viewing and accessing the shared folder
Just as you run the container, a shared folder is created, it may be empty in the beginning but as we go along we will find it fill up with different files and folders.
It’s also where you will find the downloaded models and the Apache OpenNLP binary exploded into its own directory (by the name apache-opennlp-1.9.1
).
You can access and see the contents of it from the command-prompt (outside the container) as well:
### Open a new command prompt
$ cd nlp-java-jvm-example
$ cd images/java/opennlp
$ ls ..
Dockerfile corenlp.sh opennlp reverb.sh word2vec.sh
cogcomp-nlp.sh mallet.sh openregex.sh shared
common.sh nlp4j.sh rdrposttagger.sh version.txt
$ ls ../shared
apache-opennlp-1.9.1 en-ner-date.bin en-sent.bin
en-chunker.bin en-parser-chunking.bin langdetect-183.bin
### In your case the contents of the shared folder may vary but the way to get to the folder is above.
From inside the container this is what you see:
nlp-java@cf9d493f0722:~$ ls
cogcomp-nlp.sh corenlp.sh nlp4j.sh openregex.sh reverb.sh word2vec.sh
common.sh mallet.sh opennlp rdrposttagger.sh shared
nlp-java@cf9d493f0722:~$ ls shared
MyFirstJavaNotebook.ipynb en-ner-date.bin en-pos-maxent.bin
langdetect-183.bin
apache-opennlp-1.9.1 en-ner-time.bin en-pos-perceptron.bin
notebooks
en-chunker.bin en-parser-chunking.bin en-token.bin
### In your case the contents of the shared folder may vary but the way to get to the folder is above.
Performing NLP actions inside the container
The good thing is without ever leaving your current folder you can perform these NLP actions (check out the Exploring NLP Concepts section in the README):
Usage help of any of the scripts: at any point in time you can always query the scripts by calling them this way:
nlp-java@cf9d493f0722:~$ ./[script-name.sh] --help
For e.g.
nlp-java@cf9d493f0722:~$ ./detectLanguage.sh --help
gives us this usage text as output:
Detecting language in a single-line text or article
Usage: ./detectLanguage.sh --text [text]
--file [path/to/filename]
--help
--text plain text surrounded by quotes
--file name of the file containing text to pass as command arg
--help shows the script usage help text
- Detecting language in a single-line text or article (see legend of language abbreviations used)
nlp-java@cf9d493f0722:~$ ./detectLanguage.sh --text "This is an english sentence"
eng This is an english sentence
See Detecting languages section in the README for more examples and detailed output.
- Detecting sentences in a single line text or article.
nlp-java@cf9d493f0722:~$ ./detectSentence.sh --text "This is an english sentence. And this is another sentence."
This is an english sentence.
And this is another sentence.
See Detecting sentences section in the README for more examples and detailed output.
- Finding person name, organisation name, date, time, money, location, percentage information in a single line text or article.
nlp-java@cf9d493f0722:~$ ./nameFinder.sh --method person --text "My name is John"
My name is <START:person> John <END>
See Finding names section in the README for more examples and detailed output. There are a number of types of name finder examples in this section.
- Tokenize a line of text or an article into its smaller components (i.e. words, punctuation, numbers).
nlp-java@cf9d493f0722:~$ ./tokenizer.sh --method simple --text "this-is-worth,tokenising.and,this,is,another,one"
this - is - worth , tokenising . and , this , is , another , one
See Tokenise section in the README for more examples and detailed output.
- Parse a line of text or an article and identify groups of words or phrases that go together (see Penn Treebank tag set for the legend of token types), also see https://nlp.stanford.edu/software/lex-parser.shtml.
nlp-java@cf9d493f0722:~$ ./parser.sh --text "The quick brown fox jumps over the lazy dog ."
(TOP (NP (NP (DT The) (JJ quick) (JJ brown) (NN fox) (NNS jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .)))
See Parser section in the README for more examples and detailed output.
- Tag parts of speech of each token in a line of text or an article (see Penn Treebank tag set for the legend of token types), also see https://nlp.stanford.edu/software/tagger.shtml.
nlp-java@cf9d493f0722:~$ ./posTagger.sh --method maxent --text "This is a simple text to tag"
This_DT is_VBZ a_DT simple_JJ text_NN to_TO tag_NN
See Tag Parts of Speech section in the README for more examples and detailed output.
- Text chunking by dividing a text or an article into syntactically correlated parts of words, like noun groups, verb groups. You apply this feature on the tagged parts of speech text or article. Apply chunking on a text already tagged by PoS tagger (see Penn Treebank tag set for the legend of token types, also see https://nlpforhackers.io/text-chunking/).
nlp-java@cf9d493f0722:~$ ./chunker.sh --text "This_DT is_VBZ a_DT simple_JJ text_NN to_TO tag_NN"
\[NP This_DT \] [VP is_VBZ ] \[NP a_DT simple_JJ text_NN \] [PP to_TO ] [NP tag_NN]
See Chunking section in the README for more examples and detailed output.
Exiting from the NLP Java/JVM docker container
It is as simple as this:
nlp-java@f8562baf983d:~/opennlp$ exit
exit
67.41 real 0.06 user 0.05 sys
And you are back to your local machine prompt.
Benchmarking
One of the salient features of this tool is, it’s recording and reporting metrics of its actions at different execution points - time taken at micro and macro levels, here’s a sample output to illustrate this feature:
Loading Token Name Finder model ... done (1.200s)
My name is <START:person> John <END>
Average: 24.4 sent/s
Total: 1 sent
Runtime: 0.041s
Execution time: 1.845 seconds
From the above I have come across 5 metrics that are useful for me as a scientist or an analyst or even as an engineer:
Took 1.200s to load the model into memory
(Average) Processed at an average rate of 24.4 sentences per second
(Total) Processed 1 sentence
(Runtime) It took 0.040983606557377 (0.041 seconds) to process this 1 sentence
(Execution time) The whole process ran for 1.845 seconds (startup, processing sentence(s) and shutdown)
Information like this is invaluable when it comes to making performance comparisons like:
- between two or more models (load-time and run-time performance)
- between two or more environments or configurations
- between applications doing the same NLP, action put together using different tech stacks
- also includes different languages
- finding co-relations between different corpora of text data processed (quantitative and qualitative comparisons)
Empirical example
BetterNLP library written in python is doing something similar, see Kaggle kernels: Better NLP Notebook and Better NLP Summarisers Notebook (search for time_in_secs inside both the notebooks to see the metrics reported).
Personally, it’s quite inspiring and also validates that this is a useful feature (or action) to offer to the end-user.
Other concepts, libraries and tools
There are other Java/JVM based NLP libraries mentioned in the Resources section below, for brevity we won’t cover them. The links provided will lead to further information for your own pursuit.
Within the Apache OpenNLP tool itself, we have only covered the command-line access part of it and not the Java Bindings. In addition, we haven’t gone through all the NLP concepts or features of the tool again for brevity have only covered a handful of them. But the documentation and resources on the GitHub repo should help in further exploration.
You can also find out how to build the docker image for yourself, by examining the docker-runner script.
Conclusion
After going through the above, we can conclude the following about the Apache OpenNLP tool by exploring its pros and cons:
Pros
- It’s an easy to use API and understand
- Shallow learning curve and detailed documentation with lots of examples
- Covers a lot of NLP functionality, there’s more in the docs to explore than we did above
- Easy shell scripts and Apache OpenNLP scripts have been provided to play with the tool
- Lots of resources available below to learn more about NLP (See the Resources section below)
- Resources provided to quickly get started and explore the Apache OpenNLP tool
Cons
- Looking at the GitHub repo, it seems the development is slow or has been stagnated (last two commits have a wide gap i.e. May 2019 and Oct 15, 2019)
- A few models are missing when going through the examples in the documentation (manual)
- The current models provided may need further training as per your use case(s), see this tweet:
Resources
Apache OpenNLP
- nlp-java-jvm-example GitHub project
- Apache OpenNLP | GitHub | Mailing list | @apacheopennlp
- Docs
- Download
- Legends to support the examples in the docs
- Find more in the Resources section in the README
Other related posts
About me
Mani Sarkar is a passionate developer mainly in the Java/JVM space, currently strengthening teams and helping them accelerate when working with small teams and startups, as a freelance software/data/ml engineer, more….
Twitter: @theNeomatrix369 | GitHub: @neomatrix369
Originally published at https://blog.valohai.com.
Top comments (1)
The blog is a great resource for anyone who wants to learn NLP and start using Apache OpenNLP for NLP tasks. The author's clear explanations and code examples make it easy for beginners to follow along and gain a better understanding of NLP concepts.