On a hot dry afternoon we sat discussing investment avenues during this pandemic, we realized they are limited. However, we kept coming back to shares as a solid medium to explore. This presented a good project idea as well as a good opportunity to learn more about workings of the financial sector. Here in Kenya, the main bourse is the Nairobi Stock Exchange(NSE), there are about 60+ companies listed on the exchange as of 2021. The NSE operates from Monday from Friday from 9.00 am to 3.00 pm except holidays. The main aim of this article is to develop a web scraper, and a notification script to notify us when certain ticker reaches a specific price or alternatively above a certain price threshold.
Having done web scraping projects before, I have researched an extensive list of libraries and frameworks and other tools. You can check out my write-up on a news scraper. I had a little experience with using scrapy and seemed like the perfect fit for this project. Scrapy is a web scraping framework thus it makes assumptions on how to handle certain aspects ranging from folder structure to its own cli and storing data.
This makes it great for structuring large projects or even multiple scrapers in one project. However, it also has a steep learning curve but the in-depth documentation and fairly large community more than makes up for it. For storing data usually a JSON file would be adequate, but a database ensures it'll be easy to persist and query data later on. We'll be making use of Postgresql, mainly because I used already in other projects and serves our needs nicely.
Prerequisites before getting started
To follow along this post and code the same features. You're going to need a few things:
- Python and pip (I am currently using 3.9.2) Any version above 3.5 should work.
-
-
Api Key and username from your account. Create an app and take note of the api key.
Once you've got the above sorted :-
Create a new directory and change into it.
mkdir nse_scraper cd nse_scraper
-
Create a new virtual environment for the project or activate the previous one.
Using python package manager(pip), install: beautifulsoup4, scrapy, africastalking python sdk, python-dotenv library,
sqlachemy and psycopg2 libraries.Save the installed libraries in a requirements.txt file
python -m venv . source bin/activate pip install africastalking beautifulsoup4 scrapy python-dotenv sqlachemy psycopg2 pip freeze > requirements.txt
-
As mentioned above we are using Postgresql as our database of choice hence we need a library to interface with the database, psycopg2 is a good option although there are
others. Although not necessary we'll be making use of SqlAlchemy as our Object Relation Mapper(ORM). This allows us to use python objects (classes, functions) to make transactions instead of raw SQL.
- Install Postgresql database to save all of our scraped data. Depending on which platform you code on, you could do it natively on your system. Personally I am using docker as it is easy to manage containers and prevents my system from being cluttered. This article is an awesome resource on how to get Postgresql and pgadmin4 installed as containers.
Alternatively, check the finished code on Github
Spiders Everywhere π·οΈπΈοΈ
Scrapy operates on the concept of spiders, we define our own custom spiders to crawl and scrape data.
Scrapy has its commands that makes creating a project and a spider(s) quick and easy.
Now we will create a scrapy project, generate a spider with the required boilerplate code using the cli.
scrapy startproject nse_scraper
Running the startproject command will create a folder with the structure outlined below. There is a top folder with the project name (nse_scraper) that contains the Scrapy configuration and a subfolder with the same name containing the actual crawling code.
python-projects $ tree nse_scraper
nse_scraper
βββ nse_scraper
β βββ __init__.py
β βββ items.py
β βββ middlewares.py
β βββ pipelines.py
β βββ settings.py
β βββ spiders
β βββ __init__.py
βββ scrapy.cfg
2 directories, 7 files
NB: I don't want to go into too much detail about Scrapy because there are many tutorials for the tool online, and because I normally use requests with lxml to make (very simple) data crawlers.
Many people prefer to use BeautifulSoup or other higher level data crawl libraries so feel free to go for that.
I picked Scrapy in this particular case because it creates a nice scaffold when working with crawlers and databases but this can be completely done from scratch as well.
cd nse_scraper
scrapy genspider afx_scraper https://afx.kwayisi.org/nseke/
Created spider 'afx_scraper' using template 'basic' in module:
nse_scraper.spiders.afx_scraper
You could choose to not use the generator and write the Scrapy files yourself but for simplicity I use the boilerplate that comes with Scrapy.
Now navigate to the top level project folder and create the spider (afx_scraper) using genspider
. In my case I will be crawling data from [afx](afx.kwayisi.org} about NSE share prices. There is the main nse website
or even mystocks website, however both require a subscription to get real time stock quotes. Since this project is meant to be a DIY scraper with minimal costs, afx was the most viable option. As a bonus they structure their data in a table and regularly update the prices.
As seen below:
If we take a look at the file structure again inside the spiders folder, a new file afx_scraper.py has been created.
python-projects/nse_scraper $ tree
.
βββ nse_scraper
β βββ __init__.py
β βββ items.py
β βββ middlewares.py
β βββ pipelines.py
β βββ settings.py
β βββ spiders
β βββ __init__.py
β βββ afx_scraper.py
βββ scrapy.cfg
The content of afx_scraper.py
is the minimum code required to get started with crawling data.
#afx_scraper.py
import scrapy
class AfxScraperSpider(scrapy.Spider):
name = 'afx_scraper'
allowed_domains = ['https://afx.kwayisi.org']
start_urls = ['https://afx.kwayisi.org/nseke/']
def parse(self, response):
pass
Scraper Setup
The first element we want to crawl is the table element holding all the data, we then loop through and get each ticker symbol, share name and price. The code to get data is added to the parse
function. Looking through the developer tools inside our browser we see that table
element has a tbody
element that holds tr
elements. This refers to table row html element, each row contains td
elements. This refers to table data element. this is element we want to scrape.
Scrapy allows for two ways of selecting elements in a html document:
- Using CSS selectors
- Using Xpath.
We'll start off with using CSS selector as its straightforward. We assign a row
variable to the code referencing the row of data. Due to the nature of how the individual data is displayed (similar html tags) we need to use xpath to extract data.
#afx_scraper.py
def parse(self, response):
print("Processing: " + response.url)
# Extract data using css selectors
row = response.css('table tbody tr ')
# use XPath and regular expressions to extract stock name and price
raw_ticker_symbol = row.xpath('td[1]').re('[A-Z].*')
raw_stock_name = row.xpath('td[2]').re('[A-Z].*')
raw_stock_price = row.xpath('td[4]').re('[0-9].*')
# create a function to remove html tags from the returned list
print(raw_ticker_symbol)
For each row above we use xpath to extract the required elements. The result is a combined list of data including data from the top table including top gainers and losers. Inorder to filter out what we dont need, we use regular expressions. In the case of raw_ticker_symbol
and raw_stock_price
we only need alphabetic letters thus we pass along [A-Z].*
rules to our regex. As for our price data, we need integers we pass [0-9].*
as our regex rule.
Creepy Crawlers π
Now the scraper is ready to be executed and retrieve the items. Run the crawler and verify that it is returning indeed the items that you would expect. There is no output that stores the items yet but the log tells me that there were 66 items that actually had a symbol, name and the price defined ('item_scraped_count': 66,). Note that I set the loglevel to INFO to prevent an information overload in the console.
2021-04-30 23:02:09 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: nse_scraper)
2021-04-30 23:02:09 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.2.0, Python 3.9.4 (default, Apr 20 2021, 15:51:38) - [GCC 10.2.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), cryptography 3.4.7, Platform Linux-5.11.14-147-tkg-bmq-x86_64-with-glibc2.33
2021-04-30 23:02:09 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'nse_scraper',
'EDITOR': '/usr/bin/micro',
'LOG_LEVEL': 'INFO',
'NEWSPIDER_MODULE': 'nse_scraper.spiders',
'SPIDER_MODULES': ['nse_scraper.spiders']}
2021-04-30 23:02:09 [scrapy.extensions.telnet] INFO: Telnet Password: 616b228b56a699b0
2021-04-30 23:02:09 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2021-04-30 23:02:09 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-04-30 23:02:09 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-04-30 23:02:09 [scrapy.middleware] INFO: Enabled item pipelines:
['nse_scraper.pipelines.NseScraperPipeline']
2021-04-30 23:02:09 [scrapy.core.engine] INFO: Spider opened
2021-04-30 23:02:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-04-30 23:02:09 [py.warnings] WARNING: /home/zoo/.pyenv/versions/stock-price-scraper/lib/python3.9/site-packages/scrapy/spidermiddlewares/offsite.py:65: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry https://afx.kwayisi.org in allowed_domains.
warnings.warn(message, URLWarning)
2021-04-30 23:02:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
Processing: https://afx.kwayisi.org/nseke/
2021-04-30 23:02:17 [scrapy.core.engine] INFO: Closing spider (finished)
2021-04-30 23:02:17 [scrapy.extensions.feedexport] INFO: Stored json feed (66 items) in: test.json
2021-04-30 23:02:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 443,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 9754,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 1,
'elapsed_time_seconds': 7.457298,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 4, 30, 20, 2, 17, 77130),
'item_scraped_count': 66,
'log_count/INFO': 11,
'log_count/WARNING': 1,
'memusage/max': 75550720,
'memusage/startup': 75550720,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2021, 4, 30, 20, 2, 9, 619832)}
2021-04-30 23:02:17 [scrapy.core.engine] INFO: Spider closed (finished)
Lets Clean the data!
The data we get is not usable in its current format as it contains html tags, classes,
attributes etc. Thus we need to clean it.
#afx_scraper.py
#import BeautifulSoup at the top of the file
from bs4 import BeautifulSoup
# create a function to remove html tags from the returned list
def clean_stock_name(raw_name):
clean_name = BeautifulSoup(raw_name, "lxml").text
clean_name = clean_name.split('>')
return clean_name[1]
def clean_stock_price(raw_price):
clean_price = BeautifulSoup(raw_price, "lxml").text
return clean_price
# Use list comprehension to unpack required values
stock_name = [clean_stock_name(r_name) for r_name in raw_stock_name]
stock_price = [clean_stock_price(r_price) for r_price in raw_stock_price]
stock_symbol = [clean_stock_name(r_symbol) for r_symbol in raw_ticker_symbol]
# using list slicing to remove the unnecessary data
stock_symbol = stock_symbol[6:]
cleaned_data = zip(stock_symbol, stock_name, stock_price)
for item in cleaned_data:
scraped_data = {
'ticker': item[0],
'name': item[1],
'price': item[2],
}
# yield info to scrapy
yield scraped_data
We first import BeautifulSoup library from the bs4 package. This will give us an easier time cleaning the data. The first function clean_stock_name()
accepts a value raw_name
, we then call the BeautifulSoup constructor, pass our value as
an argument, we then specify lxml as our parser. For further details on how Beautiful Soup works and different parsers, check out the documentation. We then specify we want only the text and assign it to our clean_name
variable. While cleaning the name, we still had additional characters that we didn't need, thus we call the .split()
method and return the required string.
The second function clean_stock_name()
pretty much repeats the process outlined above with the only adjustment is we don't need the extra step of adding the string split method.
We then call the functions on the each value of raw_ticker_symbol
, raw_name
and raw_stock_price
. We proceed to assign the result to appropriately named variables: stock_symbol
, stock_price
and stock_name
. The stock symbol returns additional characters than we need hence we do list slicing to get the correct length of characters and assign it to the variable. We use the zip function to create a list of all of the data retrieved. Finally we create a dictionary scraped_data
and assign relevant keys to the value of cleaned data. By using the yield
keyword our parse function is now generator thus able to return values when needed. This is especially
critical to performance when crawling multiple pages.
Lets Store all the Data!
First of all I define the schema of the element that I am crawling in the items.py.
There is no fancy schema yet but this can obviously be improved in the future when more items are being retrieved and the actual datatypes do make a difference.
# items.py
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
from scrapy.item import Item, Field
class NseScraperItem(Item):
# define the fields for your item here like:
stock_name = Field()
stock_price = Field()
stock_symbol = Field()
The middlewares.py is left untouched for the project. The important bit for storing data in a database is inside models.py. As described before I use SQLAlchemy to connect to the PostgreSQL database. The database details are stored in settings.py (see below) and are used to create the SQLAlchemy engine. I define the Items model with the three fields and use the create_items_table to create the table.
# nse_scraper/nse_scraper/models.py
from sqlalchemy import Column, Float, Integer, String, create_engine
from sqlalchemy.engine.base import Engine
from scrapy.utils.project import get_project_settings
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
def db_connect() -> Engine:
"""
Creates database connection using database settings from settings.py.
Returns sqlalchemy engine instance
"""
return create_engine(get_project_settings().get("DATABASE"))
def create_items_table(engine: Engine):
"""
Create the Items table
"""
Base.metadata.create_all(engine)
class StockData(Base):
"""
Defines the items model
"""
__tablename__ = "stock_data"
id = Column("id", Integer, primary_key=True, autoincrement=True)
stock_ticker = Column("stock_ticker", String)
stock_name = Column("stock_name", String)
stock_price = Column("stock_price", Float)
Inside the pipelines.py the spider is connected to the database. When the pipeline is started it will initalize the database and create the engine, create the table and setup a SQLAlchemy session. The process_item
function is part of the default code and is executed for every yielded item in the scraper. In this case it means it will be triggered every time a stock is retrieved with a ticker, name and price. Remember to always commit() when adding (or removing) items to the table.
# nse_scraper/nse_scraper/pipelines.py
# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from sqlalchemy.orm import sessionmaker
from nse_scraper.models import StockData, create_items_table, db_connect
class NseScraperPipeline:
def __init__(self):
"""
Initializes database connection and sessionmaker.
Creates stock_data table
"""
engine = db_connect()
create_items_table(engine)
self.Session = sessionmaker(bind=engine)
def process_item(self, item, spider):
"""
process item and store to database
"""
session = self.Session()
stock_data = StockData()
stock_data.stock_name = item["name"]
stock_data.stock_price = float(item["price"].replace(',', ''))
stock_data.stock_ticker = item["ticker"]
try:
session.add(stock_data)
session.commit()
# query again
obj = session.query(StockData).first()
# print(obj.stock_ticker)
except Exception as e:
session.rollback()
print(f"we have a problem, houston {e}")
raise
finally:
session.close()
return item
Finally, the settings.py is short and contains the information for the crawler.
The only items I have added are the DATABASE and LOG_LEVEL variables. You could choose to add your security details in this file but I would recommend to keep them secret and store them elsewhere. I have used a .env
file to store my credentials then used the python-dotenv
library to retrieve them. Note: The .env
should be in the same folder as the settings.py file or specify file path in the brackets.
# nse_scraper/nse_scraper/settings.py
# Scrapy settings for nse_scraper project
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import os
from dotenv import load_dotenv
load_dotenv()
BOT_NAME = 'nse_scraper'
SPIDER_MODULES = ['nse_scraper.spiders']
NEWSPIDER_MODULE = 'nse_scraper.spiders'
# POSTGRES SETTINGS
host = os.getenv("POSTGRES_HOST")
port = os.getenv("POSTGRES_PORT")
username = os.getenv("POSTGRES_USER")
password = os.getenv("POSTGRES_PASS")
database = os.getenv("POSTGRES_DB")
drivername = "postgresql"
DATABASE = f"{drivername}://{username}:{password}@{host}:{port}/{database}"
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'nse_scraper.pipelines.NseScraperPipeline': 300,
}
LOG_LEVEL = "INFO"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'nse_scraper (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
Your scraper is now ready to run it:
scrapy crawl afx_scraper
You should now see stock data in your database. Optionally you could output to a json file to quickly preview the data retrieved.
scrapy crawl afx_scraper -o stock.json
This article was originally meant to cover setup, data scraping and notification, however its already long and its easier break it down to two parts. Part two will cover: Database queries, sms notification using africas talking, deployment and scheduling of the web scraper.
If you have any question or comments. Let me know in the comments,
or on Twitter.
Top comments (6)
Neat work π
Thank you ππ More on the way
Outstanding
Thank you. Appreciate the support ππ
Great piece as always..
Thank you for the support ππ