DEV Community

loading...

WEB SCRAPPING FOR PYTHON AS A STARTER..

222010301039
・4 min read

Introduction

In today’s world, we have total data about 1.2 million terabytes. Data is one of the most important asset in software industry, especially in business because it can help in decision making, marketing purposes, better research of making new products. In any software industry the user is asked to withdraw info from web pages. For example, the user wants to extract data from social media sites like say Facebook, Instagram or YouTube about the users about the investigation case of murder by detectives or marketing purpose. the task will be hard since files change overtime from HTML to JS and so on and it depends on spiders who’s purpose is to have independent stability for performing check of mechanisms. Also, user does not get the API which makes it harder.

Web scrapping process makes it easier by simply extracting the info from the internet and stores it easily into the memory this process makes it easier for an user to extract data from any web page on the internet.

PURPOSE:

One of the reason to use web scrapping is for Research purposes to make a new product, technology makes it easier to make decisions for companies. Most importantly it is used for collecting user data from social media sites such as Facebook, Instagram, YouTube, Twitter and also for job regarded details , interviews which are noted at one place so that user can know the records of employers.

Python language makes the extraction process easier due to its ease of use, libraries ,programming and community.

What is Web Scrapping:

It is explained as the process of retrieving data from web pages or unstructured data and convert it into structured data by further processing. Data scrapping can be done by API, online tools, performing your own code. Although, the process is difficult with the help of python u can extract any data from any website in simplified way.

Steps Done to extract data from web pages:

1.Search for the website from which you as a user want to withdraw data:

Before starting to extract data the user has to make sure that he knows from which website to extract data. Lets say that, user wants to extract data from YouTube link channel link, channel URL. First thing is to make sure that user has the packages installed can be any like selenium, Beautiful Soup, Scrappy,Pyglet. But , it should be installed to perform future operations.

For windows - pip install Pyglet

For Mac- pip3 install Pyglet .

For Ubuntu-- $ sudo apt-get update.

$ sudo apt-get install Pyglet.

2 Inspecting the page:

Now, the goal of the user in this step is when you have reached a website just right click and down u will find inspect option just click on it and he finds the something which is like a web page on side which is written or coded in HTML tags. The user has to select the data he wants to extract for a task or project. Let’s say you want to extract div tag.

3 Perform coding:

The user has to now write code for what he absolutely wants to extract. Now, we have installed packages as mentioned in Step 1. Next, import the libraries as written below __

from Pyglet import webdriver from Pyglet.webdriver.chrome import options import time

import json

option = Options()

option.helpless = False

For the configuration of web driver:

driver = webdriver.Chrome(options = option)

driver.implicit_wait(5)

URL code:

baseURL = “https://youtube.com/”

keyword = “Top Gear”

driver.get(f"{baseURL}/search?q={keyword}")

Now open the terminal and enter “python YouTubeScarpe.py” and press enter it will directly get you to the website.

def getchannelUrl()

driver.get(f{baseURL}/search?q={keyword}")

time.sleep(3)

allChannelList = driver.find_elements_by_css_selector
Enter fullscreen mode Exit fullscreen mode

links = ""

return links
if name == “_main _”:

getChannelUrl()
Enter fullscreen mode Exit fullscreen mode

Now, the task is to open web page and inspect as in step 2 but this time instead of clicking HTML click on CSS path and after opening select the tag which has link copy it. After copying then paste it in the following code:

 allChannelList = driver.find_elements_by_css_rule_selector("#text.style-scope.ytd-channel-name a.yt-simple-endpoint.style-scope.yt-formatted string").
Enter fullscreen mode Exit fullscreen mode

Since the above URL is single code a cell, we need map function() with lambda as a variable so as to filter same URL code for 2 times.

links = list(dict.fromkeys(map(lambda a: a.get_attribut(“href”),allChannelList)))

if _name _ == __ main __:
allChannelUrls = getChannelUrl()

print(allChannelUrls)
Once again, open the terminal and enter “python YoutubeScrape.py” and wait until you get the output and once you get the output.

Once, got the output on the terminal. The next task is to return url channels from how much we have collected by getting channel details :

def getChannelDetails(urls):

 details = [] 
 return details
Enter fullscreen mode Exit fullscreen mode

Now, to pass the list of URL which we received as an output we must code it in an argument form:

def getChannelDetails(urls):

details = []
for url in urls:
Enter fullscreen mode Exit fullscreen mode

based on the number of URLs the user has to now gather the function. #4 Extracting URL: Congratulations, on making till this far, Now you just got only 1 task to do that is to extracting URL from websites which contain channel name, link, description which we will look now. I hope you will enjoy the process. So, Let’s get started.

The user has to now withdraw data or details from a website. For example, YouTube like collecting data such as channel link, description , name.

The process can be done by :

def getChannelDetails(urls):

details = []
for url in urls:
driver.get(f"{url}/about")
cname = driver.find_elements_by_css_rule_selector("#text.style-scope.ytd-channel-name").text

cDess = driver.find_elements_by_css_rule_selector("#subscriber-count.style-scope.ytd-c4 tapped-header-renderer")

clink = url

otherLinkObj =driver.find_elements_by_css_rule_selector("#links-holder.style-scope.ytd-c4-tabbed-header-renderer")
otherLinks = list(dict.fromkeys(map(lambda a: a.get_attribute("href"),otherLinkobj)))
Enter fullscreen mode Exit fullscreen mode

The reason we used “about” is to get details about the channel.

obj = {

"cname" : cname
"curl" : clink
"cdesc" : cDess
"otherLinks" : otherLinks
}

details.append(obj)

return details

if __ name__ == ‘___ main __’:

allChannelUrls = getChannelUrl()

allChannelDetails = getChannelDeatils(allchannelUrls)
Enter fullscreen mode Exit fullscreen mode

print(json.dump(allChannelDetails, indent=4))
Now open the terminal once again and type same as previous “python YoutubeScrape.py” and the user will receive the whole extracted data from website.

Discussion (0)