What’s your favourite thing about SpaCy? Mine’s SpaCy.
I read and summarise software engineering papers for fun, and today we’re having a look at A replicable comparison study of NER software: StanfordNLP, NLTK, OpenNLP, SpaCy, GATE (2019) by Schmitt et al.
Natural language processing (NLP) is a subfield of artificial intelligence that is dedicated to the understanding, processing, and generation of natural languages, like French and English.
Named entity recognition (NER) is a subtask of NLP that aims to identify entities (persons, locations) in texts. This can be used for things like machine translation, automated question answering, and automated text summarisation.
Why it matters
If you need NER, there’s no need to implement it yourself. There are several popular libraries that can do this for you nowadays. Five of these libraries, Stanford CoreNLP, NLTK, OpenNLP, SpaCy, and GATE, were already mentioned in the title.
Which library is right for you depends on various criteria, like its performance, cost, documentation, license, and the programming language in which it is implemented.
Many of these libraries have been evaluated in comparison studies, but sadly not in a way that makes it easy to compare findings.
How the study was conducted
This paper describes a comparison between the five aforementioned NER libraries, in a sufficiently clear and complete way, so that its results can be replicated.
The process looks roughly like this:
Selection of two corpora that are not domain-specific, freely available, and in English: the Groningen Meaning Bank (GMB) and the CoNLL 2003 corpus.
Selection of five NER libraries that are free and open-source software, well-documented, available for Linux, and can recognise at least three types of entities: persons, organisations, and locations.
Comparison of each NER library’s generated NER annotations with annotations in the “gold data”, which contains the annotations that we’d expect. This is done by computing the precision, recall, and F-score for each library.
What discoveries were made
The table below shows the results of the comparison. Don’t worry too much about its size and all the numbers, I’ve included a hangover-proof summary below the table.
CoNLL 2003 | GMB | ||||||
---|---|---|---|---|---|---|---|
Library | Entity | Precision | Recall | F-score | Precision | Recall | F-score |
Stanford NLP | Location | 91.30 | 88.73 | 90.00 | 83.10 | 63.64 | 72.08 |
Organisation | 86.32 | 80.92 | 83.53 | 71.40 | 47.42 | 56.99 | |
Person | 92.72 | 82.68 | 87.41 | 78.59 | 84.70 | 81.53 | |
Overall | 90.06 | 73.67 | 81.05 | 79.81 | 63.74 | 70.88 | |
NLTK | Location | 52.47 | 65.47 | 58.26 | 77.13 | 77.10 | 77.12 |
Organisation | 36.20 | 24.80 | 29.44 | 42.06 | 35.54 | 38.53 | |
Person | 61.09 | 66.11 | 63.50 | 38.07 | 55.87 | 45.28 | |
Overall | 51.78 | 45.56 | 48.47 | 60.96 | 63.91 | 62.40 | |
GATE | Location | 59.63 | 78.63 | 67.82 | 79.03 | 48.16 | 59.85 |
Organisation | 50.58 | 21.29 | 29.96 | 45.08 | 37.68 | 41.05 | |
Person | 69.53 | 62.67 | 65.92 | 46.53 | 53.70 | 49.86 | |
Overall | 61.48 | 47.44 | 53.55 | 61.72 | 46.78 | 53.22 | |
OpenNLP | Location | 76.54 | 52.22 | 62.08 | 84.34 | 45.84 | 59.40 |
Organisation | 38.06 | 14.87 | 21.39 | 59.27 | 30.64 | 40.39 | |
Person | 83.94 | 37.17 | 51.52 | 62.34 | 41.98 | 50.17 | |
Overall | 68.68 | 30.44 | 42.18 | 37.35 | 41.71 | 39.41 | |
SpaCy | Location | 73.38 | 75.36 | 74.36 | 77.04 | 56.64 | 65.28 |
Organisation | 40.95 | 36.24 | 38.45 | 41.20 | 36.50 | 38.70 | |
Person | 66.89 | 56.22 | 61.09 | 67.41 | 69.14 | 68.27 | |
Overall | 60.94 | 49.01 | 54.33 | 66.15 | 54.32 | 59.66 |
Stanford NLP’s library is the only one that has (somewhat) high scores and blows the other libraries out of the water. The other four libraries have a roughly similar level of performance.
Note that Stanford NLP’s library performs especially well on the CoNLL 2003 dataset. This is because it comes with a classifier that was partially trained on CoNLL 2003! The scores for GMB are therefore more likely to be representative for real-world texts.
The results for Stanford NLP are similar to those from other studies. However, the accuracy for three of the other libraries (NLTK, GATE, and OpenNLP) (*) may differ as much as 66% from the values reported in existing studies. It is not clear what causes such huge discrepancies.
(*) Apparently there weren’t any studies that evaluated SpaCy’s performance
Top comments (0)