Today is my 75th day of #100Daysofcode and #python learning. Today
also used the time to learn from datacamp regarding the topic pandas and also learn basic of R programming language.
And most of the time I gave to my first project. Yesterday I scrapped news of only the news category. Today I am able to scrapped news of different categories like sports, business, world. And I used pandas to visualize data obtained from different categories. I am doing my project using the drive category. As being from a non-technical field saving CSV files to drive is new for me. I learned to save the CSV file to the drive. While doing so we need to mount our goggle drive
Below is my updated code for today.
Scrapping Different Categories Of
News
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from bs4 import BeautifulSoup as BS
import requests
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
categories={"news":"https://ekantipur.com/news/",
"business":"https://ekantipur.com/business/",
"world":"https://ekantipur.com/world/",
"sports":"https://ekantipur.com/sports"
http = urllib3.PoolManager()
http.addheaders = [('User-agent', 'Mozilla/61.0')]
news_values=[]
ndict = {'Title': [], "URL": [], "Date":[],
"Author":[], "Author URL":[], "Content":[],"Category": []}
show=False
for category, url in categories.items():
web_page = http.request('GET', url)
soup = BS(web_page.data, 'html5lib')
for title in soup.findAll("h2"):
if title.a:
title_link=title.a.get("href")
# print(title_link)
if title_link.split(":")[0]!="https":
title_link=url.split(f"/{category}")[0]+title.a.get("href")
title_text=title.text
#print(title_link)
news_page = http.request('GET', title_link)
news_soup = BS(news_page.data, 'html5lib')
date = news_soup.find("time").text
author_url = news_soup.select_one(".author").a.get("href")
author_name = news_soup.select_one(".author").text
for row in news_soup.select(".row"):
for content in row.contents:
if content.select(".normal"):
content=content.p.text
break
break
catagory = url.split('/')[-1]
ndict["Title"].append(title_text)
ndict["URL"].append(title_link)
ndict["Date"].append(date)
ndict["Author"].append(author_name)
ndict["Author URL"].append(author_url)
ndict["Content"].append(content)
ndict["Category"].append(category)
if show:
print(f"""
Title: {title_text}, URL: {title_link}
Date: {date}, Author: {author_name},Category : {category},
Author URL: {author_url},
Content: {content}
""")
# news_values.append()
Dataframe of above data is,
df = pd.DataFrame(ndict, columns=list(ndict.keys()))
df
Day 75 Of #100DaysOfCode and #Python
— Durga Pokharel (@mathdurga) March 13, 2021
Scrapping different categories of
news of news portal using BeautifulSoup#womenintech #100DaysOfCode #CodeNewbie #DEVCommunity pic.twitter.com/6yKBR4YsnW
Top comments (0)