This tutorial will show how to extract only the relevant html from any article or blog post by their URL in Python.
Most of us have used Pocket app on our phones or browsers to save links and read them later. It is kind of like a bookmark app but which also saves the link's contents. After adding the link to Pocket, you can see that it extracts only the main content of the article and discards other things like the websites's footer, menu, sidebar (if any), user comments, etc. We will not be getting into the algorithm related to identifying the html tag with the most amount of text content. You can read the discussion here on stackoverflow about the algorithms.
Newspaper
Newspaper is a Python module that deals with extracting text/html from URLs. Besides this, it can also extract article's title, author, publish time, images, videos etc. And if used in conjunction with nltk
it can also extract article's keywords and summary.
In this tutorial, we will be using newspaper to extract the html tag containing the relevant contents of the URL, and then we will expose this functionality on web using flask.
Get started
We will be working in the global Python environment for simplicity of the tutorial. But you should do all the process described below in a virtual environment.
-
Install
newspaper
andflask
usingpip
.- For
python2.x
,pip install newspaper flask
- For
python3.x
,pip install newspaper3k flask
- If there is some error while installing
newspaper
you can read the detailed guide specific to your platform here.
- For
Create a file called
extractor.py
with the following code.
from newspaper import Article, Config
config = Config()
config.keep_article_html = True
def extract(url):
article = Article(url=url, config=config)
article.download()
article.parse()
return dict(
title=article.title,
text=article.text,
html=article.html,
image=article.top_image,
authors=article.authors,
)
This is the only code we need for the main part thanks to newspaper
which does all the heavy lifting. It identifies the html element containing the most relevant data, cleans it up, removes any script
, style
and other irrelevant tags that are not likely to make up the main article.
In the above code Article
is newspaper
's way of representing the article from a URL. By default, newspaper
does not save the article content's html to save some extra processing. That is why we are importing Config
from newspaper
and creating a custom configuration telling newspaper
to keep the article html.
- In the
extract
function that accepts aurl
, we first create anArticle
instance passing in theurl
and customconfig
. - Then we download the full article html with
article.download()
.newspaper
still hasn't processed the full html yet. - Now we call
article.parse()
. After this, all the necessary data related to the article inurl
will be generated. The data includes article's title, full text, author, image etc. - Then we return the data that we need in a
dict
.
Exposing as an API
Now that we have created the functionality to extract articles, we will be making this available on the web so that we can test it out in our browsers. We will be using flask
to make an API. Here is the code.
from flask import (
Flask,
jsonify,
request
)
from extractor import extract
app = Flask(__name__)
@app.route('/')
def index():
return """
<form action="/extract">
<input type="text" name="url" placeholder="Enter a URL" />
<button type="submit">Submit</button>
</form>
"""
@app.route('/extract')
def extract_url():
url = request.args.get('url', '')
if not url:
return jsonify(
type='error', result='Provide a URL'), 406
return jsonify(type='success', result=extract(url))
if __name__ == '__main__':
app.run(debug=True, port=5000)
- We first create a
Flask
app. - In its
index
route we return a simple form with text input where users will paste the url and then submit the form to the/extract
route specified inaction
. - In the
extract_url
function, we get theurl
fromrequest.args
and check if it is empty. If empty, we return an error. Otherwise will pass theurl
to ourextract
function that we created and then return the result usingjsonify
. - Now you can simply run
python app.py
and head over to http://localhost:5000 in your browser it test the implemetation.
Few things to remember
- Instead of checking the url for just empty string, we can also use regex to verify that the url, is in fact a url.
- We should also check that the
url
is not for our own domain as it will lead to an infinite loop calling the sameextract_url
function again and again. -
newspaper
will not always be able to extract the most relevant html. Its functionality completely depends on how the content is organized in the sourceurl
's website. Sometimes, it may give one or two extra paragraphs or sometimes less. But for most of the standard news websites and blogs, it will always return the most relevant html.
Things to do next
The above demonstration is a very simple application that takes in a URL, returns the html and then forgets about it. But to make this more useful, you can take it a step further by:
- Adding a database
- Save the url and its extracted contents so that you can return the result from DB if the same URL is provided again.
- You can add some more advanced APIs like returning a list of most recent queried/saved URLs and their title and other contents.
- Then you can use the API service to create a web or android/ios app similar in features to what Pocket is.
Read More
This post was originally published on bitwiser.in
Top comments (0)