Today is my 77th day of #100daysofcode and #python learning journey. Like the usual day, I purchased some hours to learned about pandas data visualization from datacamp.
For the rest of the time, I keep working on my first project(News scrapping). Today I scrapped news of Gorkha Patra online. I could scrap news on a few different pages. I need to write different codes for different news fields like national, economics, business, province, etc. So it takes a lot of time to scrapped news of the same news portal. Below is my code which I used to scrapped news of the national field.
Python code with BeautifulSoup
Here I import different dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from bs4 import BeautifulSoup as BS
import requests
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
Url of required field is given below,
url = "https://gorkhapatraonline.com/national"
Parse News, Author, Date and Contents: News
ndict = {"Title":[], "Title URL":[], "Author": [], "Date":[], "Description": [], "Content":[]}
ndict = {'Title': [], "URL": [], "Date":[],
"Author":[], "Author URL":[], "Content":[],"Category": [], "Description":[]}
for content in soup.select(".business"):
newsurl=content.find('a').get('href')
trend2 = content.select_one(".trending2")
title = trend2.find("p").text
title = title.strip()
author = trend2.find('small').text
author = author.strip()
author = author.split('\xa0\xa0\xa0\xa0\n')[0]
# author
date = trend2.find('small').text
date = date.strip()
date = date.split('\xa0\xa0\xa0\xa0\n')[1]
date=date.strip()
description = trend2.select_one(".description").text.strip()
# now got to this news url
http.addheaders = [('User-agent', 'Mozilla/61.0')]
web_page = http.request('GET',newsurl)
news_soup = BS(web_page.data, 'html5lib')
author_url = news_soup.select_one(".post-author-name").find("a").get("href")
news_content=""
for p in news_soup.select_one(".newstext").findAll("p"):
news_content+="\n"+p.text
ndict["Title"].append(title)
catagory = url.split("/")[-1]
print(f"""
Title: {title}, URL: {newsurl}
Date: {date}, Author: {author},
Category :{catagory} ,
Author URL: {author_url},
Description: {description},
Content: {news_content}
""")
Day 77 Of #100DaysOfCode and #Python
— Durga Pokharel (@mathdurga) March 16, 2021
Worked On My First Project (Scrapping news of gorkhapatraonline using beautifulSoup)#WomenWhoCode #CodeNewbie #100DaysOfCode #DEVCommunity pic.twitter.com/T2JZyl2XqF
Top comments (0)