loading...

Thoughts about sanitizing this Python RSS-scraping code?

katiekodes profile image Katie ・3 min read

I've banged out a quick-and-dirty Python script to generate HTML that displays the bilingual tech comic "CommitStrip" side-by-side in French and in English, so that my eyes can better compare the phrasing the between the two versions.

It helps me learn the idioms more effectively.

I was thinking I'd throw it up as a single page on a hosting service like Heroku, at a secret URL, and visit it every so often. (So, ultimately, I'd have more Python, to make sure there's a server listening for a request to return HTML.)

Question is ... I'm not an experienced web developer.

What might I want to sanitize, and what would be overkill?

  • Having my browser visit the images, for example, seems no more dangerous than going to their web site and doing the same thing. But I guess I'd want to somehow make sure the "images" are just a simple URL of a format something like the one they typically follow? Does that seem necessary?
  • Should I worry about sanitizing the "response" overall in any way before it even hits my Python modules like ElementTree? (Is there any particular danger of CommitStrip's RSS feed being compromised when the rest of their web site isn't harming all their visitors?)
  • etc.

I have a general sense that you're not supposed to just "take stuff from strangers and display it in a browser," but I'm not sure how that plays out when it comes to "scraping web sites and redisplaying their contents."

Thanks for any tips!

(old code removed)

Updated code, as deployed successfully (but not sure how securely) to Heroku:

import os
import requests
from bs4 import BeautifulSoup
import bleach
from flask import Flask, render_template, Markup


# Initialize the Flask application
app = Flask(__name__)

contentstring = '{http://purl.org/rss/1.0/modules/content/}'
rf = requests.get('https://www.commitstrip.com/fr/feed/')
re = requests.get('https://www.commitstrip.com/en/feed/')

fcl = bleach.clean(rf.text if rf.status_code==200 else '', tags=['rss','item','img','guid','title'], attributes={'img':['src','class']}, strip=True)
ecl = bleach.clean(re.text if re.status_code==200 else '', tags=['rss','item','img','guid','title'], attributes={'img':['src','class']}, strip=True)

comicsDict = {}
for (lang, postslist) in [('fr', BeautifulSoup(fcl, 'html.parser').find_all('item')), ('en', BeautifulSoup(ecl, 'html.parser').find_all('item'))]:
    for post in postslist:
        masterid = None
        masterid = post.find('guid').get_text().split('?p=')[1]
        if masterid not in comicsDict:
            comicsDict[masterid] = {}
        comicsDict[masterid][lang+'_'+'title']=post.find('title').get_text()
        comicsDict[masterid][lang+'_'+'imgurl']=post.find('img', {'class':'alignnone'})['src']

htmlstring = Markup(
        '<html><head><meta charset="UTF-8" /><title>CommitStrip Feed Bilingue</title></head><body>' +
        ''.join(['<div class="comic" style="width:100%;"><div class="fr" style="float: left; width: 50%;"><img src="' + v['fr_imgurl'] + '" style="width:100%;"><figcaption>' + v['fr_title'] + '</figcaption></div> <div class="en" style="float: left; width: 50%;"><img src="' + v['en_imgurl'] + '" style="width:100%;"><figcaption>' + v['en_title'] + '</figcaption></div><br style="clear: both;" /></div>' for k,v in comicsDict.items()])
        + '</body></html>'
    )




# Define content for the home page of our app
@app.route('/')
def index():
    return render_template('index.html', 
                                       thepage=htmlstring
                                       )

# Start the server
#'''
# This seems to work well when live on Heroku (localhost, my firewall yells at me)
if __name__ == '__main__':
    port = int(os.environ.get("PORT", 5000))
    app.run(host="0.0.0.0", port=port)
#'''

'''
# This seems to work well when testing on localhost
if __name__ == '__main__':
    app.run()
'''
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <title>Commitstrip Bilingue</title>
  </head>
  <body>
    <div class="wholepage">
        {{ thepage }}
    </div>
  </body>
</html>

Posted on by:

katiekodes profile

Katie

@katiekodes

Je cherche à vous aider à atteindre vos objectifs #code en #français . My goal is to help you work faster by sharing what I know about #SQL, #Python, and #Salesforce in #English and #French

Discussion

pic
Editor guide
 

Salut Katie!

A few observations:

  • I wouldn't worry that much for the images, you trust the source don't you? It seems that all images on commitstrip are in the form - commitstrip.com/wp-content/uploads... - so you could add a filter to check if they come from that domain

  • If CommitStrip is ever compromised and take over the website and the RSS feed there is a (remote) possibility for script injection BUT you can tell BeautifulSoup to remove <script> tags with decomponse or extract

This way you can be sure you're never going to inject a script tag inside your HTML instead of an image, probably not necessary anyway.

I have a general sense that you're not supposed to just "take stuff from strangers and display it in a browser," but I'm not sure how that plays out when it comes to "scraping web sites and redisplaying their contents."

This depends on their content policy.

 

Thanks -- these are exactly the kinds of tips I was looking for!

 

Awesome share😀

 

Thanks! I've kind of fallen in love with Commit Strip as a language tool. They do great work, writing bilingual humor -- tough stuff!