Detecting Fake News with Python and Machine Learning

Context

The prevalence of fake news has increased with the recent rise of social media, especially the Facebook News Feed, and this misinformation is gradually seeping into the mainstream media. Several factors have been implicated in the spread of fake news, such as political polarization, post-truth politics, motivated reasoning, confirmation bias, and social media algorithms.
Fake news can reduce the impact of real news by competing with it. For example, a BuzzFeed analysis found that the top fake news stories about the 2016 U.S. presidential election received more engagement on Facebook than top stories from major media outlets. It also particularly has the potential to undermine trust in serious media coverage. The term has at times been used to cast doubt upon credible news.
Multiple strategies for fighting fake news are currently being actively researched, for various types of fake news. Politicians in certain autocratic and democratic countries have demanded effective self-regulation and legally-enforced regulation in varying forms, of social media and web search engines.
On an individual scale, the ability to actively confront false narratives, as well as taking care when sharing information can reduce the prevalence of falsified information, however, it has been noted that this is vulnerable to the effects of confirmation bias, motivated reasoning and other cognitive biases that can seriously distort reasoning, particularly in dysfunctional and polarized societies. Inoculation theory has been proposed as a method to render individuals resistant to undesirable narratives.

Key Terms

It is crucial, in order to proceed to become acquainted with certain key-terms that will be used throughout this article.

Python

Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. Python's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. Python supports modules and packages, which encourages program modularity and code reuse. The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed.

Machine Learning

Machine learning is an application of AI which provides the ability to system to learn things without being explicitly programmed. Machine learning works on data and it will learn through some data. Machine learning is very different from the traditional approach. In, Machine learning we fed the data, and the machine generates the algorithm. Machine learning has three types of learning

Supervised learning
Unsupervised learning
Reinforcement learning

Supervised learning means we trained our model with labeled examples so the machine first learns from those examples and then performs the task on unseen data.

Fake news

Fake news's simple meaning is to incorporate information that leads people to the wrong path. Nowadays fake news spreading like water and people share this information without verifying it. This is often done to further or impose certain ideas and is often achieved with political agendas.
For media outlets, the ability to attract viewers to their websites is necessary to generate online advertising revenue. So it is necessary to detect fake news.

TfidfVectorizer

The TfidfVectorizer is used when one wishes to convert a collection of raw documents into a matrix of TF and IDF features.

IDF (Inverse Document Frequency)

The IDF is used as a measure of calculating how significant a word is in an entire corpus. To do so, it calculates how many times a word appears on a set of documents.

TF (Term Frequency)

The TF, unlike the IDF is the number of times a word appears in a single document.

Passive Aggressive Classifier

Passive Aggressive are considered algorithms that perform online learning (with for example twitter data). Their characteristic is that they remain passive when dealing with an outcome which has been correctly classified, and become aggressive when a miscalculation takes place, thus constantly self-updating and adjusting.

Natural Language Processing

Machine learning data only works with numerical features so we have to convert text data into numerical columns. So we have to preprocess the text and that is called natural language processing.
In-text preprocess are carried out by steaming, lemmatization, remove stopwords, remove special symbols and numbers, etc. After cleaning the data we have to feed this text data into a vectorizer which will convert this text data into numerical features.

Cleaning Data

We can’t use text data directly because it has some unusable words and special symbols and many more things. If we used it directly without cleaning then it is very hard for the ML algorithm to detect patterns in that text and sometimes it will also generate an error. So that we have to always first clean text data.

Lemmatization:

Convert the word or token in its Base form.

Split the Data

Splitting the data is the most essential step in machine learning. We train our model on the trainset and test our data on the testing set. We split our data in train and test using the train_test_split function from Scikit learn.

Libraries

Maybe you should install at least one of the following libraries in Python. They should be installed with pip:
pip3 install pandas
pip3 install sklearn
pip3 install numpy

Classification Metrics

To check how well our model we use some metrics to find the accuracy of our model. There are many types of classification metrics available in Scikit learn

Confusion Matrix
Accuracy Score
Precision
Recall
F1-Score

Confusion matrix:

Basically this metrics how many results are correctly predicted and how many results are not correctly predicted

Accuracy Score:

It is the number of correct prediction over the total no. of predictions

Conclusions

We are in the era of Machine Learning. One of the great thing about it, is that while it is extremely difficult to preprocess date and train a model, it is fine tune it and obtain state of art results on your dataset.
More complex and efficient methods could be surely applied to datasets, for example using the entire text or extracting different features.