Web scrapping part 4 (professionally)

#webscrapping #java #python

Hello, dev masters how are you let welcome here with another section of web scrapping here I have e already complete 3 sections here and let's enjoy this 4th one.

2.Scrapy framework

Scrapy at a glance!

Scrapy is a web scraping framework of python. Scrapy is best if you need to build
a web crawler for large web scraping needs. Scrapy uses spiders which are self
contained crawlers that are given a complete set of instructions. In scrapy it is
much easier to build and scale large crawling projects by allowing developers to
reuse their codes. Scrapy also provides a web crawling shell known as “Scrapy
shell” that developer can use to test all assumptions on a website behavior.

First of all you need to install following packages
that i have Summarized.

I would suggest you to use linux platform instead of others, And i will
also explain you about selenium webdriver.

Run these commands to your terminal.

To install scrapy sudo apt install python-scrapy.

To install pip sudo apt-get update && sudo apt-get install python-pip.

To install selenium sudo pip install selenium.

To install chrome webdriver.

$ sudo apt-get install chromium-chromedriver

$ sudo ln -s /usr/lib/chromium-browser/chromedriver/usr/bin/chromedriver

$ sudo apt-get install libxi6 libgconf-2-4

Now you need to do is:-

Creating a new Scrapy project.
Write a spider to crawl a site and extract data.
Exporting the scraped data using the command line.

To create your first project open the terminal and change the directory where you
have installed scrapy then run the following command >>> scrapy
startproject project_name

This will create a directory which looks like this:

project_name/

scrapy.cfg # deploy configuration file
project_name/ # project's Python module,
you'll import your code from here
init.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put
your spiders
init.py

Writing our first Spider

All spiders in scrapy is nothing but python classes that scrapy uses to get all the information from a website. All classes (spiders) must inherit with scrapy.
Spider and define the initial request to make. That means how to follow the links in the pages, how to get data from parsed pages, how to parse the data etc.
Hope it is clear to you if not then simply assume that this is class in which we
make different functions and every function is used for parsing the urls.

Now we will code our first Spider. Save it in a file named test.py or you can name it anything with.py extension under the project_name/spiders directory in your project:

Thanks for reading this hope you enjoy I'll continue this series here good luck everyone.

DEV Community

Web scrapping part 4 (professionally)

Top comments (0)

Read next

Enhancing Observability in Machine Learning with OpenTelemetry: InsightfulAI Update

Implementing Soft Delete in Spring WebFlux with R2DBC

Demystifying hashCode() and equals(): The Backbone of Java Hash Collections

How I Automated My Workflow by Connecting Python to Google Sheets API