You may know HN. A news aggregator with tech articles.
Let's scrape that with Python. The PyQuery module allows you to query HTML pages. You can collect all the links with PyQuery
#!/usr/bin/python3
from pyquery import PyQuery as pq
doc =pq(url = "https://news.ycombinator.com/front?day=2019-07-14" )
for link in doc('a.storylink'):
print(link.attrib['href'])
That returns the links for the day "2019-07-14". So you have a list of links printed to the screen. You want that in a file.
Hockey dockey.
You can save the output into a csv file. A csv file is a file with all values stored with a delimiter in between, usually a colon but we'll use a semicolon.
#!/usr/bin/python3
from pyquery import PyQuery as pq
date = "2019-07-14"
doc =pq(url = "https://news.ycombinator.com/front?day=" + date )
links = []
for link in doc('a.storylink'):
links.append(link.attrib['href'])
with open('output.csv','w+') as csvfile:
for link in links:
csvfile.write( date + ";" + link + ";" )
csvfile.write('\n')
Simple right? :) Run it and you'll have all the links in a nicely formatted csv file.
A csv file can read with an office program (any spreadsheet) or you can read them using Python pandas.
Related links:
Top comments (0)