IMPORTXML() to webscrape :)
Did you know you can webscrape using Google Sheets? Me neither, but it's the year 2022 and we have 9 more months in this year to learn so many new random things.
I enjoy reading about personal finance and one such person I enjoy reading posts from is Bob at Tawcan.com. He has a post detailing how he uses
IMPORTXML() to scrape dividend data from Google Finance and Yahoo Finance. This is really useful since the built in
GOOGLEFINANCE() function doesn't have this information.
I ended up modifying what he had and also had to learn about
Here is one example of a cell function he uses to get the dividend and yield information from Yahoo Finance:
IMPORTHTML() you only get a choice of scraping for
In the above example,
B2 contains a stock ticker symbol like
RY which is concatenated with the
url to the page we want scraped. We end up getting the
2nd table on the page with
IMPORTHTML(). Sheets ends up importing the table which isn't very useful for us so we further parse the table using
INDEX() to get the exact cell we want. In this case the dividend yield information is the 6th row, and 2nd column with the index starting from
1. That returns a dividend and yield so we further
SPLIT() that information.
If you check out Bob's post, you'll see that he constantly needs to update his formulas because Google Finance and Yahoo Finance keep updating layouts. He also uses
IMPORTXML() at some point because the pages are dynamic and don't usually have that information statically available.
I decided to find other sources for dividend and yield and came across dividend history. What's nice is that this source also has dividend and yield information for ETFs which isn't the case for Google and Yahoo Finance. I don't know how often the layout is updated but given how simple it is and how it serves a single purpose, I assume it won't be updated too often.
=INDEX(SPLIT(IMPORTXML("https://dividendhistory.org/payout/"&IF($B2="NASDAQ", "", "TSX/")&$A2, "//p[contains(text(), 'Yield')]", "en_US"), ":%"), 1, 2)
In my use case, I like to keep ticker symbols in column A, and their exchanges in column B. This is important because Canadian stocks (TSX) are found in
payout/TSX/<tickersymbol> whereas American stocks are in
payout/<tickersymbol>. I also don't use
CONCATENATE() and prefer to just use
The second argument in
IMPORTXML() is the
Thankfully HTML is basically just a very specific XML amirite? Anyway the query looks for any
<p> that contains the text value
'Yield' between the tags.
And since I only want the numbers I
SPLIT() at the
% and take the second element (remember in Sheets, index starts at 1).
=INDEX(IMPORTHTML("https://dividendhistory.org/payout/"&IF($B2="NASDAQ", "", "TSX/")&$A2, "table", 2, "us_EN"), 4, 3)
The dividend is easier since I can use
IMPORTHTML() and select the exact cell from the
2nd table on the page.
This isn't perfect because some tickers don't have any news or announcements, as is the case for
In this case, the dividend history table is actually the
1st table on the page. We end up with a reference error because I'm looking for a
2nd table that doesn't exist (out of bounds!):
My first solution to this problem was thinking maybe all ETFs or even REITs don't usually have announcements so I can track a column for the
TYPE of investment and pick
2 accordingly. This proved to be an incorrect assumption.
My second solution was to make the first
IMPORTHTML() request and if
ISERR(<request>) then make a second request searching for the
=IF(ISERR(G2),INDEX(IMPORTHTML("https://dividendhistory.org/payout/"&IF($B2="NASDAQ", "", "TSX/")&$A2, "table", 1, "us_EN"), 4, 3), G2)
This is really inefficient because it means making up to 2 requests for a single value. I stored the first request in column
G and call it the "test" column:
Anyway, that was a fun little rabbit hole and I think I'll find many uses for webscraping on sheets that may or may not involve personal finance!