DEV Community

Carlos A. Martinez
Carlos A. Martinez

Posted on • Updated on

Web Scraping

Web Scraping

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.

Web Crawling

A Web crawler, sometimes called a **spider **or **spiderbot **and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).

  • Prosecuting Computer Crimes

Robots.txt

robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

Example: https://booking.com/robots.txt

Web Scraping Sandbox

download_page.py

import requests
from bs4 import BeautifulSoup

URL = 'https://toscrape.com/'

web_scraping_sandbox = requests.get(URL)
Enter fullscreen mode Exit fullscreen mode
web_scraping_sandbox.status_code 
# output --> 200
Enter fullscreen mode Exit fullscreen mode
web_scraping_sandbox.text

# output -->
<!DOCTYPE html>
<html lang="en">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>Scraping Sandbox</title>
        <link href="./css/bootstrap.min.css" rel="stylesheet">
        <link href="./css/main.css" rel="stylesheet">
    </head>
    <body>
        <div class="container">
            <div class="row">
                .
                .
                .
            </div>
        </div>
    </body>
</html>
Enter fullscreen mode Exit fullscreen mode
web_scraping_sandbox.headers
# output --> 
{'Date': 'Fri, 13 Oct 2023 02:20:03 GMT', 'Content-Type': 'text/html', 'Content-Length': '3939', 'Connection': 'keep-alive', 'Last-Modified': 'Wed, 08 Feb 2023 21:02:33 GMT', 'ETag': '"63e40de9-f63"', 'Accept-Ranges': 'bytes', 'Strict-Transport-Security': 'max-age=0; includeSubDomains; preload'}
Enter fullscreen mode Exit fullscreen mode
web_scraping_sandbox.request.method
# output --> Method: GET
Enter fullscreen mode Exit fullscreen mode

Adding BeautifulSoup

soap = BeautifulSoup(web_scraping_sandbox.text, 'lxml')
print(type(soap))
# output --> <class 'bs4.BeautifulSoup'>

print(soap.find('h2'))
# output --> <h2>Books</h2>
Enter fullscreen mode Exit fullscreen mode

Under Construction

Image description

Top comments (0)