Wow! Scraping Wikipedia content With 10 line of code!

#python #vscode

"Hackers loves to use scraping to harvest data.~Ankit Dobhal"
original blog is here - >blog

Welcome to My Blog

Hello my Computer Geek Friend!!This is a blog about scraping wikipedia content using python & bs4(python module),So what is exactly web scraping & from where this term comes from?Let's Try To Understand!!
Web Scraping - :
Web scraping is data scraping process used for extracting data from websites.Web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler.It comes when world wide web born.Most of time search engine like google uses crawling process in their search result.

Scraping With Python - :
Web scraping & crawling can be done with the help of some softwares but in Nowadays Python is gaining its popularty in the field of web scraping & crawling ,& as we all know python is one of the most famous & powerful scripting languages generally for hackers & shell coders.Python have some amazing & powerful modules & libraries which makes this scraping process so easy & useful,Their are two important modules in python one is requests & another is BeautifulSoup.

Let's Write Python Script to scrape wikipedia content or wikpedia searcher:

I have a basic understanding how to do get request to websites using python , so first of all I open up my vscode editor and create file name as wikipy.py.Then import sys library(command line argument), requests library(for downloading & get method to wikipedia), & my favorite library BeautifulSoup as bs4 (To extact content from wikipedia page).

Now its time to use get method to requests data from wikipedia server , but wait I want to create a wikipedia searcher which will scrape the data according to my command line argument.So let's create a variable name as res to store get method to wikipedia search url & add it with my command line argument.

note: I uses raise_for_status() method if their is any error code and status code comes so this method will raise that & whole script will terminate.

res download the whole page but it is complicating to extract data from the page bacuase it shows the html format data , so now this is time to use BeautifulSoup to extract data. So I am creating a variable name as wiki to extract data.

note: As you can in wiki variable I uses Beautiful Soup function with two parameters ,So what they are exactly? let's understand. res.text is a text format of the page which is downloaded with the help of res variable & html.parser is a parser which will help me to structure the data into html format.

I want to scrape the p tag content according to command line argument because the whole text content of Wikipedia page is inside the p tag you can check this with the help of developer tools of chrome & Firefox.

Now I am using .select() function to select p tag & for loop to looping throgh it ,then finally printing the text elements imside p tag with.getText() function.

Yeah we did it in just 10 line of code bravo!!!
Its time to run the script with command line argument >>

Thankyou all for visting my blog you can also check my gist for wikipy script the link is below!!
wiki.py
follow me on github & linkedlin for more exciting blogs and scripts!
This blog is basically quoted from my blog website visit original blog->
https://ankitdobhal.github.io/posts/2019/10/Scraping%20Wikipedia%20With%20Python/

Top comments (8)

Andy Preston • Oct 5 '19

Instead of scraping Wikipedia and consuming the foundations' bandwidth & server capacity, why not take advantage of the offline mirror's available en.wikipedia.org/wiki/Wikipedia:Da...

Jose Nario • Oct 5 '19

I was just about to suggest the same. +1

Rob • Oct 6 '19

How about a way to download the app icon from Google play. Say I have a file with list of several package names that I want links to the icon. Get a list of links into Excel file for each package name so I can download the images.

Trying to automate Soni don't have to manually search each package and right-click to save app icon.

ScribbleG • Oct 5 '19

The million dollar question is how can you save the page locally with the related images?

powerexploit • Oct 6 '19

You can use file handling

metz2000 • Oct 14 '19

Most browsers can save a web page with all resources (images, css, javascript) and can use an embedded browser or a tool like Selenium to automate it.