This article was originally published at https://programmerbackpack.com.
Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.
In an earlier article I talked about starting a journey about studying Machine Learning by starting a personal project - a personal knowledge management system that can help me track the things I learn.
While defining my requirements for an app like this, I also look into new things and share them here, maybe someone else will also find them useful.
Let's say I am caught up in a research session and I stumble upon a name of a researcher which sounds familiar to me. I can of course look that person up on Google, but what if I want to know where do I know this name from? Have I read something published by this author or have I read some piece of news about him/her? It would be useful to have my research history saved somewhere and look this person up in that history and find out I've enjoyed some of this author's work before.
Information Extraction is a very difficult problem. The task of transforming natural language – so something that is very nuanced and can have subtle differences from human to human – to something that all computers can understand is insanely difficult and is a problem we are still very far from solving. Still programmers are used to taking a big problem and solving it piece by piece until, hopefully, the whole task is solved.
Named Entity Recognition is a subtask of the Information Extraction field which is responsible for identifying entities in an unstrctured text and assigning them to a list of predefined entities. The list of entities can be a standard one or a particular one if we train our own linguistic model to a specific dataset. But most of the times, the entities which are usually identified are Persons, Organisations, Locations, Time, Monetary values and so on.
Named Entity Recognition consists actually of two substeps: Named Entity Identification and Named Entity Classification and that means we first find the entities mentioned in a given text and only then we assign them to a particular class in our list of predefined entities.
For example, let's have the following sentence:
"Bill Gates was the CEO of Microsoft until 2000."
Here we can identify that Bill Gates, Microsoft and 2000 are our entities. We then correctly classify them as Person, Organisation and Date respectively. We must take care so that we do not identify Bill and Gates as two different enitities, as we are using both words for talking about the same person!
Honestly it really dependes on who built the model. I know it sounds superficial, but it's the truth. There is a lot of research going on for finding the perfect NER model, and researchers come up with different methods and approaches. I am also sure that there is a lot of research which has not been published, but that's because companies use proprietary technologies to ensure they build the best model there is.
But of course, there are some steps that every NER model should take, and this is what we are going to talk about now.
First step in Named Entity Recognition is actually preparing the data to be parsed. As we discussed here, preparing the data for NLP is quite a long and complicated journey. We are talking about building a pipeline that can do the following for you:
- Sentence boundary segmentation
- Word tokenization
- Part of Speech tagging
Second step in Named Entity Recognition would be searching the tokens we got from the previous step agains a knowledge base. The knowledge base can be an ontology with words, their meaning and the relationships between them.
The search can also be made using deep learning models. This approach has the advantage that it gets better results when seeing new words which were not seen before(as opposed to the ontology, where we would get no results in this situation).
Third step in Named Entity Recognition would happen in the case that we get more than one result for one search. Then we would need some statistical model to correctly choose the best entity for our input.
Lucky for us, we do not need to spend years researching to be able to use a NER model. We can use one of the best in the industry at the moment, and that is spaCy. I highly encourage you to open this link and look it up. It has lots of functionalities for basic and advanced NLP tasks. And doing NER is ridiculously easy, as you'll see.
First let's install spaCy and download the English model.
pip3 install spacy python3 -m spacy download en_core_web_sm
Then open up your favourite editor. We will use two extracts from the Wikipedia page about Vue.js.
This will give us the following entities:
We can see that most of the entities have been identified correctly. No misidentification(no entity which has been identified as something when it should have been something else) but still we have one example of an entity which has not been identified at all("AngularJS").
But all we needed were 4 lines of code and we got our Named Entity Recognition system! You can check here all the entities that spaCy can identify.
We can visualise the results we get by adding only one line of code:
So in today's article we discussed a little bit about Named Entity Recognition and we saw a simple example of how we can use spaCy to build and use our Named Entity Recognition model. Thank you so much for reading this article, I hope you enjoyed it as much as I did writing it!
Interested in more? Follow me on Twitter at @b_dmarius and I'll post there every new article.