Web Scraping Sprott U Fund with BS4 in 10 Lines of Code

#python #beautifulsoup #dataengineering

I started my second career as a Nuclear Fuel Uranium trader around a decade ago. A few years in, I was frustrated with my company's refusal to upgrade systems beyond 7 spreadsheets with redundant information scattered throughout, so I started my journey learning about databases, data engineering, and learning how to automate things with Python. One of the datapoints I scrape currently as background, contextual data (until I get the time to put it into a component!) on my uranium-focused dashboard is data scraped from the market newcomer, Sprott Uranium Fund's daily updated website. Here is tutorial on how I do it using Python Package bs4.

First we import our packages

import requests
from bs4 import BeautifulSoup

Then we request the website using the requests package. If the response comes back successful 200, we use BeautifulSoup to parse it.

url = 'https://sprott.com/investment-strategies/physical-commodity-funds/uranium/'
r = requests.get(url)
if r.status_code == 200:
    soup: BeautifulSoup = BeautifulSoup(r.content, "html.parser")

Congratulations! You now have the webpage locally in your computer's memory. But how do we extract their share price and the volume of Uranium the fund is currently holding?

You can go to that URL and open up the Developer's view to look at elements, look at the source code for the whole page in your browser, or use BeautifulSoup's prettify() function to see it in your Jupyter Notebook with print(soup.prettify().

You'll find the share price and Uranium volume about an 1/5 of the way down the page. Here is a sample of what I'm looking at:

<div class="cell small-6 large-3 fundHeader_data">
            <h3 class="fundHeader_title">
             Premium/Discount
            </h3>
            <div class="fundHeader_value">
             -2.55%
            </div>
            <!-- <div class="fundHeader_detail">52 wk: <strong>$31.45 - $45.98</strong></div>-->
           </div>
           <div class="cell small-6 large-3 fundHeader_data">
            <h3 class="fundHeader_title mt05">
             Total lbs of U
             <sub>
              3
             </sub>
             O
             <sub>
              8
             </sub>
            </h3>
            <div class="fundHeader_value">
             40,780,707
            </div>

The values are stored in a div class called "fundHeader_value." To get all of them and extract the ones with the share price and Uranium stocks, we use BeautifulSoup findall function storing it in a variable called fund_values (a list).

fund_values = soup.find_all('div', class_='fundHeader_value')

The share price is the 4th value in that list, so you use Python list slice and call the contents function to get it in a way you can manipulate it in Python.

shareprice = fund_values[4].contents

If you print the variable shareprice, you'll get a lot of stuff you don't want in there.

['\r\n                                    $US11.81\r\n                                                ', <span class="fundHeader_icon fundHeader_icon--pos"><i data-feather="trending-up"></i></span>, '\n']

First thing, is that we want the contents of the first item in this list, so shareprice[0]. We then want to get rid of the other stuff around it, namely white spaces and key returns. To make sure we're manipulating a string object, we can tell Python to recognize it as a string with str(shareprice[0]). Python has a very powerful method for "stripping" away whitespace with .strip(), so we call that after our string str(shareprice[0]).strip().

That gives us $US11.81 as a string. If that's what you want, you can stop there, but if you want to put it into a chart or store it as a number in a database, you need to also get rid of the $US. Luckily, Python has another method for "replacing" the part of the string you don't want with nothing. You just have to put .replace('$US','') on it and it returns 11.81.

That was a long explanation for one line of text, but it shows how concisely Python can get things done!

shareprice_value = str(shareprice[0]).strip().replace('$US','')

How about the Uranium volume? Easy...Rinse and repeat. The only difference is that it has commas instead of $US and is the 6th item in the list of fund_values.

u3o8 = fund_values[6].contents
u3o8_stock = str(u3o8[0]).strip().replace(',','')

So there you have it, you have scraped the fund's website in 10 lines of code (12 if you count the extra 2 for the Uranium Volumes).

Raise my dopamine levels with a Like. I'll try to write more technical stuff here.

Here is the full code: (Find it here in Github as well)[https://github.com/CincyBC/bootstrap-to-airflow]

import requests
from bs4 import BeautifulSoup

url = 'https://sprott.com/investment-strategies/physical-commodity-funds/uranium/'
r = requests.get(url)
if r.status_code == 200:
    soup: BeautifulSoup = BeautifulSoup(r.content, "html.parser")

fund_values = soup.find_all('div', class_='fundHeader_value')
shareprice = fund_values[4].contents
shareprice_value = str(shareprice[0]).strip().replace('$US','')

u3o8 = fund_values[6].contents
u3o8_stock = str(u3o8[0]).strip().replace(',','')

DEV Community

Web Scraping Sprott U Fund with BS4 in 10 Lines of Code

Oldest comments (0)