DEV Community

Paymon Wang Lotfi
Paymon Wang Lotfi

Posted on • Updated on

Create a Text Summarization API With Flask, Sumy, and Trafilatura

Text summarization APIs can be expensive, inconsistent, and inaccurate. Customize one for your use-case with these python libraries.

Writing the API

After experimenting with several summarization libraries, I found sumy to be the most accurate. There is no new text generated — sumy simply scores each sentence by significance and returns the top x results. Because the first paragraph of an article usually attempts to summarize the page, I wanted to include it with every response.

Although sumy’s scoring system performs well, its article extraction mechanism is not great. It will often interpret comments, advertisements, and unrelated subsections as part of the story. This is fine because the unrelated sentences will score very low and be excluded from the result. To consistently extract the first paragraph and determine the length of a story, trafilatura is far more effective.

The /url/ endpoint takes 2 parameters — the url and relative length of the summary on a scale from 0 to 1. The /text/ endpoint takes plain text instead of a url, useful for summarizing long comments. The code in this guide is only a basic template — a more official API would have authentication, error-handling, database connections, and other features.

import trafilatura
from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer as Summarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from flask import Flask, request, jsonify, Response
from ratelimit import limits
import nltk

nltk.download('punkt')

app = Flask(__name__)

@app.route('/url/', methods=['POST'])
@limits(calls=1, period=1)
def respond():    

    url = request.form.get("url", None)
    length = request.form.get("length", None)
    LANGUAGE = "english"
    parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
    stemmer = Stemmer(LANGUAGE)
    summarizer = Summarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANGUAGE)

    downloaded = trafilatura.fetch_url(url)
    y = trafilatura.process_record(downloaded, include_comments=False, include_tables=False, deduplicate=True, target_language="en", include_formatting=False)

    response = []

    if(y == None):        
        firstParagraph = ""
        l = len(parser.document.sentences)
        SENTENCES_COUNT = int(l*float(length))
    else:        
        firstParagraph = ""
        l = len(y.split("\n"))
        SENTENCES_COUNT = int(l*float(length))
        for p in y.split("\n"):
            if len(p) > 150:
                firstParagraph = p
                break

    if firstParagraph!="":
        response.append(firstParagraph+"\n\n")
    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        if str(sentence) not in firstParagraph:
            response.append(str(sentence) + "  ")

    res = ""  

    for s in response:  
        res += s   

    return Response(res, mimetype="text/plain")


@app.route('/text/', methods=['POST'])
@limits(calls=1, period=1)
def respond_text():

    y = request.form.get("text", None)
    length = request.form.get("length", None)

    LANGUAGE = "english"
    parser = PlaintextParser.from_string(y,Tokenizer("english"))
    stemmer = Stemmer(LANGUAGE)
    summarizer = Summarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANGUAGE)

    response = []
    l = len(y.split(". "))
    SENTENCES_COUNT = int(l*float(length))*2

    for sentence in summarizer(parser.document, SENTENCES_COUNT):
        response.append(str(sentence) + "  ")

    res = ""  

    for s in response:  
        res += s   

    return Response(res, mimetype="text/plain")



if __name__ == '__main__':

    app.run(threaded=True, port=5000)

Enter fullscreen mode Exit fullscreen mode

Deployment

I deployed the app using Heroku CLI, with the Procfile and requirements.txt shown below. I had to modify the files several times to get the API from running locally to live, using the Heroku logs as a guideline. If you haven’t deployed to Heroku before, you can read how here.

trafilatura
flask
gunicorn
sumy
nltk
numpy
ratelimit
Enter fullscreen mode Exit fullscreen mode
web: gunicorn app:app
Enter fullscreen mode Exit fullscreen mode

Make a call to the API - https://text-summarize-api.herokuapp.com/url/?url=https://www.nature.com/articles/d41586-020-02706-6&length=0.3

Conclusion

You might be better subscribing to one of the many existing summarization APIs, especially if you’re integrating it into a paid service. However, it is important to keep in mind that a custom API can perform similarly or better, can be customized for your project, and is completely free at smaller scales.

Top comments (0)