DEV Community

milktea02
milktea02

Posted on

Webscraping in Google Sheets (to do personal finance)

TL;DR: use IMPORTHTML() and IMPORTXML() to webscrape :)

TOC:

Did you know you can webscrape using Google Sheets? Me neither, but it's the year 2022 and we have 9 more months in this year to learn so many new random things.

Bob's your uncle

I enjoy reading about personal finance and one such person I enjoy reading posts from is Bob at Tawcan.com. He has a post detailing how he uses IMPORTHTML() and IMPORTXML() to scrape dividend data from Google Finance and Yahoo Finance. This is really useful since the built in GOOGLEFINANCE() function doesn't have this information.

I ended up modifying what he had and also had to learn about XPath and XQuery.

Here is one example of a cell function he uses to get the dividend and yield information from Yahoo Finance:

=SPLIT(INDEX(IMPORTHTML(concatenate("https://finance.yahoo.com/quote/",B2),"table",2),6,2)," ")
Enter fullscreen mode Exit fullscreen mode

Using IMPORTHTML() you only get a choice of scraping for "table" or "list".

In the above example, B2 contains a stock ticker symbol like RY which is concatenated with the url to the page we want scraped. We end up getting the 2nd table on the page with IMPORTHTML(). Sheets ends up importing the table which isn't very useful for us so we further parse the table using INDEX() to get the exact cell we want. In this case the dividend yield information is the 6th row, and 2nd column with the index starting from 1. That returns a dividend and yield so we further SPLIT() that information.

If you check out Bob's post, you'll see that he constantly needs to update his formulas because Google Finance and Yahoo Finance keep updating layouts. He also uses IMPORTXML() at some point because the pages are dynamic and don't usually have that information statically available.

My Solution

I decided to find other sources for dividend and yield and came across dividend history. What's nice is that this source also has dividend and yield information for ETFs which isn't the case for Google and Yahoo Finance. I don't know how often the layout is updated but given how simple it is and how it serves a single purpose, I assume it won't be updated too often.

Screenshot of a page on dividendhistory.org showing dividend information for The Royal Bank of Canada stock

Scraping for Yield

=INDEX(SPLIT(IMPORTXML("https://dividendhistory.org/payout/"&IF($B2="NASDAQ", "", "TSX/")&$A2, "//p[contains(text(), 'Yield')]", "en_US"), ":%"), 1, 2)
Enter fullscreen mode Exit fullscreen mode

In my use case, I like to keep ticker symbols in column A, and their exchanges in column B. This is important because Canadian stocks (TSX) are found in payout/TSX/<tickersymbol> whereas American stocks are in payout/<tickersymbol>. I also don't use CONCATENATE() and prefer to just use "<string>"&A1&"<string>".

The second argument in IMPORTXML() is the xquery_path.

//p[contains(text(), 'Yield')]
Enter fullscreen mode Exit fullscreen mode

Thankfully HTML is basically just a very specific XML amirite? Anyway the query looks for any <p> that contains the text value 'Yield' between the tags.

Yield: 1.23%
Enter fullscreen mode Exit fullscreen mode

And since I only want the numbers I SPLIT() at the : and % and take the second element (remember in Sheets, index starts at 1).

Scraping for Dividend ($)

=INDEX(IMPORTHTML("https://dividendhistory.org/payout/"&IF($B2="NASDAQ", "", "TSX/")&$A2, "table", 2, "us_EN"), 4, 3)
Enter fullscreen mode Exit fullscreen mode

The dividend is easier since I can use IMPORTHTML() and select the exact cell from the 2nd table on the page.

This isn't perfect because some tickers don't have any news or announcements, as is the case for VGRO:

Another screenshot of dividend history but for VGRO ETF ticker

In this case, the dividend history table is actually the 1st table on the page. We end up with a reference error because I'm looking for a 2nd table that doesn't exist (out of bounds!):

#REF
Enter fullscreen mode Exit fullscreen mode

My first solution to this problem was thinking maybe all ETFs or even REITs don't usually have announcements so I can track a column for the TYPE of investment and pick 1 or 2 accordingly. This proved to be an incorrect assumption.

My second solution was to make the first IMPORTHTML() request and if ISERR(<request>) then make a second request searching for the 1st table.

=IF(ISERR(G2),INDEX(IMPORTHTML("https://dividendhistory.org/payout/"&IF($B2="NASDAQ", "", "TSX/")&$A2, "table", 1, "us_EN"), 4, 3), G2)
Enter fullscreen mode Exit fullscreen mode

This is really inefficient because it means making up to 2 requests for a single value. I stored the first request in column G and call it the "test" column:

Image description

Anyway, that was a fun little rabbit hole and I think I'll find many uses for webscraping on sheets that may or may not involve personal finance!

References

https://www.tawcan.com/using-google-spreadsheet-dividend-investment/#Even_Newer_Formula1
https://support.google.com/docs/answer/3093281?hl=en
https://www.w3schools.com/xml/xpath_syntax.asp

Top comments (0)