DEV Community

Cover image for Wow! Scraping Wikipedia content With 10 line of code!
powerexploit
powerexploit

Posted on

Wow! Scraping Wikipedia content With 10 line of code!

"Hackers loves to use scraping to harvest data.~Ankit Dobhal"
original blog is here - >blog

Welcome to My Blog

Hello my Computer Geek Friend!!This is a blog about scraping wikipedia content using python & bs4(python module),So what is exactly web scraping & from where this term comes from?Let's Try To Understand!!
Web Scraping - :
Web scraping is data scraping process used for extracting data from websites.Web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler.It comes when world wide web born.Most of time search engine like google uses crawling process in their search result.

Scraping With Python - :
Web scraping & crawling can be done with the help of some softwares but in Nowadays Python is gaining its popularty in the field of web scraping & crawling ,& as we all know python is one of the most famous & powerful scripting languages generally for hackers & shell coders.Python have some amazing & powerful modules & libraries which makes this scraping process so easy & useful,Their are two important modules in python one is requests & another is BeautifulSoup.

Let's Write Python Script to scrape wikipedia content or wikpedia searcher:

I have a basic understanding how to do get request to websites using python , so first of all I open up my vscode editor and create file name as wikipy.py.Then import sys library(command line argument), requests library(for downloading & get method to wikipedia), & my favorite library BeautifulSoup as bs4 (To extact content from wikipedia page).
Alt Text
Now its time to use get method to requests data from wikipedia server , but wait I want to create a wikipedia searcher which will scrape the data according to my command line argument.So let's create a variable name as res to store get method to wikipedia search url & add it with my command line argument.
Alt Text
note: I uses raise_for_status() method if their is any error code and status code comes so this method will raise that & whole script will terminate.
Alt Text
res download the whole page but it is complicating to extract data from the page bacuase it shows the html format data , so now this is time to use BeautifulSoup to extract data. So I am creating a variable name as wiki to extract data.
Alt Text
note: As you can in wiki variable I uses Beautiful Soup function with two parameters ,So what they are exactly? let's understand. res.text is a text format of the page which is downloaded with the help of res variable & html.parser is a parser which will help me to structure the data into html format.

I want to scrape the p tag content according to command line argument because the whole text content of Wikipedia page is inside the p tag you can check this with the help of developer tools of chrome & Firefox.
Alt Text
Now I am using .select() function to select p tag & for loop to looping throgh it ,then finally printing the text elements imside p tag with.getText() function.
Alt Text

Yeah we did it in just 10 line of code bravo!!!
Its time to run the script with command line argument >>
Alt Text

Thankyou all for visting my blog you can also check my gist for wikipy script the link is below!!
wiki.py
follow me on github & linkedlin for more exciting blogs and scripts!

This blog is basically quoted from my blog website visit original blog->
https://ankitdobhal.github.io/posts/2019/10/Scraping%20Wikipedia%20With%20Python/

Top comments (8)

Collapse
 
andy_preston profile image
Andy Preston

Instead of scraping Wikipedia and consuming the foundations' bandwidth & server capacity, why not take advantage of the offline mirror's available en.wikipedia.org/wiki/Wikipedia:Da...

Collapse
 
jnario profile image
Jose Nario

I was just about to suggest the same. +1

Collapse
 
slcmotor profile image
Rob

How about a way to download the app icon from Google play. Say I have a file with list of several package names that I want links to the icon. Get a list of links into Excel file for each package name so I can download the images.

Trying to automate Soni don't have to manually search each package and right-click to save app icon.

Collapse
 
scribbleghost profile image
ScribbleG

The million dollar question is how can you save the page locally with the related images?

Collapse
 
powerexploit profile image
powerexploit

You can use file handling

Collapse
 
metz20001 profile image
metz2000

Most browsers can save a web page with all resources (images, css, javascript) and can use an embedded browser or a tool like Selenium to automate it.

Collapse
 
emizex profile image
layefa Amakubukuro

Very informative

Collapse
 
comradeanonymousexe profile image
comradeanonymousexe

how I select only 2 or 3 "p" element?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.