Adam

Posted on Jun 1, 2019 • Updated on Jun 7, 2019 • Originally published at builtin.com

An easy introduction to natural language processing with Python

Computers are great at working with standardized and structured data like database tables and financial records. They are able to process that data much faster than we humans can. But us humans don’t communicate in “structured data” nor do we speak binary! We communicate using words, a form of unstructured data.

Unfortunately, computers suck at working with unstructured data because there’s no standardized techniques to process it. When we program computers using something like C++, Java, or Python, we are essentially giving the computer a set of rules that it should operate by. With unstructured data, these rules are quite abstract and challenging to define concretely.

There’s a lot of unstructured natural language on the internet; sometimes even Google doesn’t know what you’re searching for!

HUMAN VS COMPUTER UNDERSTANDING OF LANGUAGE

Human’s have been writing things down for thousands of years. Over that time, our brain has gained a tremendous amount of experience in understanding natural language. When we read something written on a piece of paper or in a blog post on the internet, we understand what that thing really means in the real-world. We feel the emotions that reading that thing elicits and we often visualize how that thing would look in real life.

Natural language processing (NLP) is a sub-field of artificial intelligence that is focused on enabling computers to understand and process human languages, to get computers closer to a human-level understanding of language. Computers don’t yet have the same intuitive understanding of natural language that humans do. They can’t really understand what the language is really trying to say. In a nutshell, a computer can’t read between the lines.

That being said, recent advances in machine learning have enabled computers to do quite a lot of useful things with natural language! Deep learning has enabled us to write programs to perform things like language translation, semantic understanding, and text summarization. All of these things add real-world value, making it easy for you to understand and perform computations on large blocks of text without the manual effort.

Let’s start with a quick primer on how NLP works conceptually. Afterwards we’ll dive into some Python code so you can get started with NLP yourself!

THE REAL REASON WHY NLP IS HARD

The process of reading and understanding language is far more complex than it seems at first glance. There are many things that go in to truly understanding what a piece of text means in the real-world. For example, what do you think the following piece of text means?

“Steph Curry was on fire last nice. He totally destroyed the other team”

To a human it’s probably quite obvious what this sentence means. We know Steph Curry is a basketball player; or even if you don’t we know that he plays on some kind of team, probably a sports team. When we see “on fire” and “destroyed” we know that it means Steph Curry played really well last night and beat the other team.

Computers tend to take things a bit too literally. Viewing things literally like a computer, we would see “Steph Curry” and based on the capitalization assume it’s a person, place, or otherwise important thing which is great! But then we see that Steph Curry “was on fire”…. A computer might tell you that someone literally lit Steph Curry on fire yesterday! … yikes. After that, the computer might say that Mr. Curry has physically destroyed the other team…. they no longer exist according to this computer… great…

The rest of this article with Python tutorials can be found here: https://builtin.com/data-science/easy-introduction-natural-language-processing