DEV Community

Cover image for News Articles Classifier NLP Notebook
amananandrai
amananandrai

Posted on • Updated on

News Articles Classifier NLP Notebook

I created a News Articles Classifier Python Notebook which was made using scikit-learn and NLTK libraries of Python. scikit-learn is a library that has machine learning algorithms for supervised and unsupervised learning. Classification and Regression are the basic learning algorithms included in the Supervised Learning methodology of Machine Learning.

In this notebook, I have classified the News articles into four categories:- "World News", "Sports News", "Business News" and "Science-Tech News". The categories are also known as labels or classes in Machine Learning. As, the number of classes in this model are more than two it is not a binary classifier but a multiclass classifier.

The NLTK library stands for Natural Language Toolkit and it is used for performing various tasks for preprocessing of text data like: lemmatization, tokenization, removing stopwords and stemming, etc. It is a library used for Natural Language Processing.

I have also made WordClouds for all the different types of news articles i.e. world news, sports news, business news, and science and technology news.

I have used the Kaggle platform for making the notebook whose link is given
here.

Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges. - Wikipedia

Kaggle notebooks previously known as kernels are a cloud based workbench for running Machine Learning and Data science programs. It supports two languages Python and R. Also the notebook format is a Jupyter notebook and supports Markdown blocks for writing text portions.

In this notebook I have used five different classification algorithms and seen how each of them works on this given dataset by comparing there accuracies. The dataset contains two files one for training and the other for testing. The training set consists of 1,20,000 news articles. I have used 50% of the data for training. Also it can be seen that the dataset is balanced which means it contains equal proportions of all classes in the dataset.

The classification algorithms used are:-

1 - Multinomial Naive Bayes
2 - Decision Tree
3 - Gaussian Naive Bayes
4 - Stochastic Gradient Descent Classifier
5 - Light Gradient Boosting Machine Classifier

I have used Python's matplotlib and Seaborn library for plotting graphs and data visualization.

Please, go through the notebook and give your valuable feedback. Also, if you like the notebook upvote it on Kaggle. 👨😄

The link to the notebook is
https://www.kaggle.com/amananandrai/news-article-classifier-with-different-models

Top comments (0)