loading...

Comparing the same web scraper in Haskell, Python, Go

yujiri8 profile image Ryan Westlund Updated on ・4 min read

So this project started with a need - or, not really a need, but an annoyance I realized would be a good opportunity to strengthen my Haskell, even if the solution probably wasn't worth it in the end.

There's a blog I follow (Fake Nous) that uses Wordpress, meaning its comment section mechanics and account system are as convoluted and nightmarish as Haskell's package management. In particular I wanted to see if I could do away with relying on kludgy Wordpress notifications that only seem to work occasionally and write a web scraper that'd fetch the page, find the recent comments element and see if a new comment had been posted.

I've done the brunt of the job now - I wrote a Haskell script that outputs the "Name on Post" string of the most recent comment. And I thought it'd be interesting to compare the Haskell solution to Python and Go solutions.

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TupleSections #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE MultiWayIf #-}
{-# LANGUAGE ViewPatterns #-}

import Network.HTTP.Req
import qualified Text.HTML.DOM as DOM
import qualified Text.XML.Cursor as Cursor
import qualified Text.XML.Selector as Selector
import qualified Data.XML.Types as Types
import qualified Text.XML as XML
import Data.Text (Text, unpack)
import Control.Monad

main = do
    resp <- runReq defaultHttpConfig $ req GET (https "fakenous.net") NoReqBody lbsResponse mempty
    let dom = Cursor.fromDocument $ DOM.parseLBS $ responseBody resp
        recentComments = XML.toXMLNode $ Cursor.node $ head $ Selector.query "#recentcomments" $ dom
        newest = head $ Types.nodeChildren recentComments
    putStrLn $ getCommentText newest

getCommentText commentElem =
    let children = Types.nodeChildren commentElem
    in foldl (++) "" $ unwrap <$> children

unwrap :: Types.Node -> String
unwrap (Types.NodeContent (Types.ContentText s)) = unpack s 
unwrap e = unwrap $ head $ Types.nodeChildren e

My Haskell clocs in at 25 lines, although if you remove unused language extensions, it comes down to 21 (The other four in there just because they're "go to" extensions for me). So 21 is a fairer count. If you don't count imports as lines of code, it can be 13.

Writing this was actually not terribly difficult; of the 5 or so hours I probably put into it in the end, 90% of that time was spent struggling with package management (the worst aspect of Haskell). In the end I finally resorted to Stack even though this is a single-file script that should be able to compile with just ghc.

I'm proud of my work though, and thought it reflected fairly well on a language to do this so concisely. My enthusiasm dropped a bit when I wrote a Python solution:

import requests
from bs4 import BeautifulSoup

file = requests.get("https://fakenous.net").text

dom = BeautifulSoup(file, features='html.parser')
recentcomments = dom.find(id = 'recentcomments')
print(''.join(list(recentcomments.children)[0].strings))

6 lines to Haskell's 21, or 4 to 13. Damn. I'm becoming more and more convinced nothing will ever displace my love for Python.

Course you can attribute some of Haskell's relative size to having an inferior library, but still.

Here's a Go solution:

package main

import (
    "fmt"
    "net/http"

    "github.com/ericchiang/css"
    "golang.org/x/net/html"
)

func main() {
    var resp, err = http.Get("https://fakenous.net")
    must(err)
    defer resp.Body.Close()
    tree, err := html.Parse(resp.Body)
    must(err)
    sel, err := css.Compile("#recentcomments > *:first-child")
    must(err)
    // It will only match one element.
    for _, elem := range sel.Select(tree) {
        var name = elem.FirstChild
        var on = name.NextSibling
        fmt.Printf("%s%s%s\n", unwrap(name), unwrap(on), unwrap(on.NextSibling))
    }

}

func unwrap(node *html.Node) string {
    if node.Type == html.TextNode {
        return node.Data
    }
    return unwrap(node.FirstChild)
}

func must(err error) {
    if err != nil {
        panic(err)
    }
}

32 lines, including imports. So at least Haskell came in shorter than Go. I'm proud of you, Has- oh nevermind, that's not a very high bar to clear.

It would be reasonable to object that the Python solution is so brief because it doesn't need a main function, but in real Python applications you generally still want that. But even if I modify it:

import requests
from bs4 import BeautifulSoup

def main():
    file = requests.get("https://fakenous.net").text
    dom = BeautifulSoup(file, features='html.parser')
    recentcomments = dom.find(id = 'recentcomments')
    return ''.join(list(recentcomments.children)[0].strings)

if __name__ == '__main__': main()

It only clocs in at 8 lines, including imports.

An alternate version of the Go solution that doesn't hardcode the number of nodes (since the Python and Haskell ones don't):

package main

import (
    "fmt"
    "net/http"

    "github.com/ericchiang/css"
    "golang.org/x/net/html"
)

func main() {
    var resp, err = http.Get("https://fakenous.net")
    must(err)
    defer resp.Body.Close()
    tree, err := html.Parse(resp.Body)
    must(err)
    sel, err := css.Compile("#recentcomments > *:first-child")
    must(err)
    // It will only match one element.
    for _, elem := range sel.Select(tree) {
        fmt.Printf("%s\n", textOfNode(elem))
    }

}

func textOfNode(node *html.Node) string {
    var total string
    var elem = node.FirstChild
    for elem != nil {
        total += unwrap(elem)
        elem = elem.NextSibling
    }
    return total
}

func unwrap(node *html.Node) string {
    if node.Type == html.TextNode {
        return node.Data
    }
    return unwrap(node.FirstChild)
}

func must(err error) {
    if err != nil {
        panic(err)
    }
}

Though it ends up being 39 lines.

Maybe Python's lead would decrease if I implemented the second half, having the scripts save the last comment they found in a file, read it on startup, and update if it's different and notify me somehow (email could be an interesting test). I doubt it, but if people like this post I'll finish them.

Edit: I finished them.

Posted on by:

yujiri8 profile

Ryan Westlund

@yujiri8

I'm a programmer, writer, and philosopher. My Github account is yujiri8; all my content besides code is at yujiri.xyz.

Discussion

markdown guide
 

A Haskell one-liner:

(toListOf $ responseBody . to (decodeUtf8with lenientDecode) . html . allAttribute (folded . only "recentcomments") . children) <$> (get "https://fakenous.net")
 

Can you give some context for this? When I plug it in, even with all the imports I used, almost everything in there is undefined.

I'm ready with the full version that does the saving and emailing me for all three languages, but I'm holding off on posting now because I don't want to finalize if the Haskell can be improved by that much.

 

Apologies, I should have thought of this earlier. Anyway, adding more details:

-- file : Main.hs
{-# LANGUAGE OverloadedStrings #-}

module Main where

import Control.Lens (to, only, toListOf, folded)
import Data.Text.Encoding.Error (lenientDecode)
import Data.Text.Lazy.Encoding (decodeUtf8With)
import Network.Wreq (responseBody, get)
import Text.Taggy.Lens (html, children, allAttributed)

main = (toListOf $ responseBody . to (decodeUtf8With lenientDecode) . html . allAttributed (folded . only "recentcomments") . children) <$> (get "https://fakenous.net") >>= print 

The dependencies can be put in a dev-to.cabal file:

-- dev-to.cabal
cabal-version:       2.4
name:                dev-to
version:             0.1.0.0
license-file:        LICENSE
author:              Providence Salumu
maintainer:          Providence <dot> Salumu <at> smunix <dot> com
extra-source-files:  CHANGELOG.md

executable dev-to
  main-is:             Main.hs
  build-depends:       base ^>=4.13.0.0
                     , lens
                     , bytestring
                     , http-client
                     , text
                     , taggy
                     , taggy-lens
                     , wreq
  default-language:    Haskell2010

Doing the saving and emailing you would be a simpler addition.

You can clone my repo from github.com/smunix/dev-to

Ah. Still, that doesn't seem to be a complete solution. I ran it with cabal run and the output is the object:

[[NodeElement (Element {eltName = "li", eltAttrs = fromList [("class","recentcomments")], eltChildren = [NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeContent "Gerardo"]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1130#comment-1909")], eltChildren = [NodeContent "The Failings of Analytic Philosophy"]})]}),NodeElement (Element {eltName = "li", eltAttrs = fromList [("class","recentcomments")], eltChildren = [NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://yujiri.xyz"),("rel","external nofollow ugc"),("class","url")], eltChildren = [NodeContent "Yujiri"]})]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1704#comment-1908")], eltChildren = [NodeContent "How Can You Put a Price on Human Life?"]})]}),NodeElement (Element {eltName = "li", eltAttrs = fromList [("class","recentcomments")], eltChildren = [NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeContent "Paul Lake"]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=327#comment-1907")], eltChildren = [NodeContent "Studies in Irrationality: Marxism"]})]}),NodeElement (Element {eltName = "li", eltAttrs = fromList [("class","recentcomments")], eltChildren = [NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeContent "Dave"]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1704#comment-1905")], eltChildren = [NodeContent "How Can You Put a Price on Human Life?"]})]}),NodeElement (Element {eltName = "li", eltAttrs = fromList [("class","recentcomments")], eltChildren = [NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","http://www.daviddfriedman.com"),("rel","external nofollow ugc"),("class","url")], eltChildren = [NodeContent "David Friedman"]})]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1674#comment-1904")], eltChildren = [NodeContent "Do Religious People Believe Religion?"]})]})],[NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeContent "Gerardo"]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1130#comment-1909")], eltChildren = [NodeContent "The Failings of Analytic Philosophy"]})],[NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://yujiri.xyz"),("rel","external nofollow ugc"),("class","url")], eltChildren = [NodeContent "Yujiri"]})]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1704#comment-1908")], eltChildren = [NodeContent "How Can You Put a Price on Human Life?"]})],[NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeContent "Paul Lake"]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=327#comment-1907")], eltChildren = [NodeContent "Studies in Irrationality: Marxism"]})],[NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeContent "Dave"]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1704#comment-1905")], eltChildren = [NodeContent "How Can You Put a Price on Human Life?"]})],[NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","http://www.daviddfriedman.com"),("rel","external nofollow ugc"),("class","url")], eltChildren = [NodeContent "David Friedman"]})]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1674#comment-1904")], eltChildren = [NodeContent "Do Religious People Believe Religion?"]})]]

Instead of the text.

I also wouldn't consider that one line. If I were to really use that code, I'd certainly break it into 2-4. Still, it is an impressive improvement! I'll have to look more into those libraries.

 

One may also argue that your python code uses the beautifulsoup library which has already done the hard work of parsing the html/xml for you!

(though in fairness, I don't know much about haskell or go to comment on how "bare metal" those pieces of code are).

 

True, beautiful soup seems much high-level than the other libraries. Though I am using at-least two non-standard libraries for all languages (Go has great high-level HTTP in the stdlib but needed 2 HTML/traversal libraries just to get there, for Haskell I'm using 5 libraries: req, html-conduit, dom-selector, xml-conduit and xml-types (might be a way to cut down on those but I really couldn't find it cause some of those libraries are just like 'provides HTML helpers for XML types' or something)).

 

I would recommend Colly (github.com/gocolly/colly) to get a better comparison since you are using BeautifulSoup for Python. Both scraper libraries have superb APIs.

 

Wow! I didn't know about that library. That does much more for me here than even BeautifulSoup! New&Improved Go version:

package main

import (
    "fmt"

    "github.com/gocolly/colly"
)

func main() {
    var col = colly.NewCollector()
    col.OnHTML("#recentcomments > *:first-child", func(e *colly.HTMLElement) {
        fmt.Println(e.Text)
    })
    col.Visit("https://fakenous.net")
}

That gets it down to about same number of "meaningful" lines as Python. Technically can drop 2 more lines by putting the function inline, but I wouldn't do that IRL.

 

I was expecting performance write up.
Lol

 

Yes, I would tend to think "reasonably well written" Haskell would outperform Python, but I think this is too small of an example to really make a determination. I can't comment on Google Go. By "reasonably well written" I mean Haskell that is written without too many "rookie" mistakes like I might make. While Haskell (in general) likely has good performance, I think newbies such as myself can sometimes wind up doing things that are logically correct but inappropriate from an efficiency perspective.

 
 

I appreciate this post, but I still tend to agree with @Providence Salumu that Haskell can also do this in one or two lines depending on what package you might use.

Now, it's a good thing they don't give downvotes here because I'm about to anger many people. I'm not going to comment on Google Go, but while it is true Python has a lot of neat packages written for it, Python is (In My Opinion) a scripting language that is "broken by design"...

No matter how hard you try, Python will never be able to do certain things that Haskell can do (like purity). On the other hand, Haskell can likely be modified to do just about anything Python can.

I continue to be frustrated by coworkers that insist Haskell is difficult to learn and is obscure just because Python got better marketing over the course of the last 20 or so years. Python was adopted by the masses (In My Opinion) because people were convinced that it was a "good" language when in reality, it wasn't all that great. Yes, Python is easy to learn, but that doesn't make it a good language...

 

Now, it's a good thing they don't give downvotes here because I'm about to anger many people.

Lol, nothing to fear from me. Spicy opinions are fun. I've got a few myself that I haven't posted here mainly for fear of the reception. I'm gonna have to disagree with this one though...

No matter how hard you try, Python will never be able to do certain things that Haskell can do (like purity). On the other hand, Haskell can likely be modified to do just about anything Python can.

I think this is really moot if not backward. Functional purity (and really all language features) isn't a goal, but a tool for reaching our goals, so it's wrong to describe it as something "Python will never be able to do no matter how hard you try". Lacking that feature doesn't reduce the domain of problems Python can solve. And it surely has features Haskell can never replicate, like breakpoint, default arguments, or proper struct inheritance.

Not to mention, isn't it a tiny minority of languages that support language-enforced functional purity? Even among other compiled languages?

There are use cases Python can't do that Haskell can, like compiling to a shared library to be called from another language. But that applies to all scripting languages. Do you think all scripting languages are broken by design?

To be fair, if you do, I don't find that totally unreasonable. I prefer compiled languages and abhor not having type checking. My opinion of Python is more that it has bad core design in a couple areas, but is so much more practical than languages that try to be perfect and fail. Basically "the best that the wrong way of doing things can provide", while most other languages are "the worst that the right way of doing things can provide".

About that... something I started thinking recently was that Haskell is the only languages I know that tries to be perfect. The others don't seem like their designers ever intended to make something that would revolutionize programming. Go is the epitome of this, as its designers have said something like "Go isn't meant to advance programming theory, it's meant to advance programming practice" (translation: we don't want a good design we just want to special case the common stuff).

I continue to be frustrated by coworkers that insist Haskell is difficult to learn and is obscure just because Python got better marketing over the course of the last 20 or so years. Python was adopted by the masses (In My Opinion) because people were convinced that it was a "good" language when in reality, it wasn't all that great. Yes, Python is easy to learn, but that doesn't make it a good language...

It's true that being easy to learn doesn't make it a good language, but I think it does count toward it. After all, tools exist to make work more efficient, so a tool that takes more time to learn is, all other things the same, a worse tool. And there is no way Haskell's learning curve is all due to inadequate tutorials and documentation. It's conceptually arcane.

 

I don't necessarily agree with you totally, but your goal of objectivity is at least refreshing so I liked your post. Thanks.

 

Would love to see more posts like this!

 

Great post, haven't tried Haskell yet, looks interesting. Can you do a performance test on each version? The LOC is surely a factor but knowing the performance would be even better.

 

Python seems to average about 1.6 seconds. The first run was 3 seconds which is probs because of filesystem caching or TLS resumption. Go is averaging about 1.25 and Haskell about 1.35.

I don't think performance really means much here though, because on such a short program, factors like the time to start the interpreter and parse source code, write to the console, etc, are much more significant than they should be. The Haskell binary dynamically loads 11 system libraries while the Go binary only loads 2 dynamically, and that might account for the speed difference there. I've heard dynamic linking increases startup costs.