"Hackers loves to use scraping to harvest data.~Ankit Dobhal"
original blog is here - >blog
Hello my Computer Geek Friend!!This is a blog about scraping wikipedia content using python & bs4(python module),So what is exactly web scraping & from where this term comes from?Let's Try To Understand!!
Web Scraping - :
Web scraping is data scraping process used for extracting data from websites.Web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler.It comes when world wide web born.Most of time search engine like google uses crawling process in their search result.
Scraping With Python - :
Web scraping & crawling can be done with the help of some softwares but in Nowadays Python is gaining its popularty in the field of web scraping & crawling ,& as we all know python is one of the most famous & powerful scripting languages generally for hackers & shell coders.Python have some amazing & powerful modules & libraries which makes this scraping process so easy & useful,Their are two important modules in python one is requests & another is BeautifulSoup.
I have a basic understanding how to do get request to websites using python , so first of all I open up my vscode editor and create file name as wikipy.py.Then import sys library(command line argument), requests library(for downloading & get method to wikipedia), & my favorite library BeautifulSoup as bs4 (To extact content from wikipedia page).
Now its time to use get method to requests data from wikipedia server , but wait I want to create a wikipedia searcher which will scrape the data according to my command line argument.So let's create a variable name as res to store get method to wikipedia search url & add it with my command line argument.
note: I uses raise_for_status() method if their is any error code and status code comes so this method will raise that & whole script will terminate.
res download the whole page but it is complicating to extract data from the page bacuase it shows the html format data , so now this is time to use BeautifulSoup to extract data. So I am creating a variable name as wiki to extract data.
note: As you can in wiki variable I uses Beautiful Soup function with two parameters ,So what they are exactly? let's understand. res.text is a text format of the page which is downloaded with the help of res variable & html.parser is a parser which will help me to structure the data into html format.
I want to scrape the p tag content according to command line argument because the whole text content of Wikipedia page is inside the p tag you can check this with the help of developer tools of chrome & Firefox.
Now I am using .select() function to select p tag & for loop to looping throgh it ,then finally printing the text elements imside p tag with.getText() function.
Thankyou all for visting my blog you can also check my gist for wikipy script the link is below!!
follow me on github & linkedlin for more exciting blogs and scripts!
This blog is basically quoted from my blog website visit original blog->