In my last post I wrote about using Google Sheets to webscrape dividend data from dividendhistory.org.
Today I'm going to share with you my adventures to improve efficiency and how to store data using sheets TRANSPOSE()
and FLATTEN()
functions and search using MATCH()
.
TOC
The Improved Scrape
Before, to scrape the dividend history table I was using the IMPORTHTML()
function but also noted that some times I may make up to two calls because the web page may only have one table instead of two causing essentially an index out of range error.
After reviewing the page source I found out the table has an id
:
<table id='dividend_table' class='table table-striped table-bordered'>
<thead>
<tr>
<th>Ex-Dividend Date</th>
<th>Payout Date</th>
<th>Cash Amount</th>
<th>% Change</th>
</tr>
</thead>
<tbody>
<tr>
<td><i>2022-10-29</i></td>
<td><i>2022-11-26</i></td>
<td><i>$1.33**</i></td>
<td><i>unconfirmed/estimated</i></td>
</tr>
<tr>
<td><i>2022-07-30</i></td>
<td><i>2022-08-26</i></td>
<td><i>$1.33**</i></td>
<td><i>unconfirmed/estimated</i></td>
</tr>
<tr>
<td>2022-04-29</td>
<td>2022-05-26</td>
<td>$1.33</td>
<td></td>
</tr>
...
...
</table>
d'oh!
I can use IMPORTXML()
and just query for this table directly!
=IMPORTXML("https://dividendhistory.org/payout/TSX/BMO", "//table[@id='dividend_table']/*/*")
And using that query looks like this in Sheets:
You can continue to use the INDEX(range, row, col)
function to grab the exact cells you want but for myself I wanted to do a few fun things.
Packing the data
Now, I only really need some rows of data and I don't really care for the % Change
column, so we use ARRAY_CONSTRAIN(range, rows, cols)
=ARRAY_CONSTRAIN(IMPORTXML("https://dividendhistory.org/payout/TSX/BMO", "//table[@id='dividend_table']/*/*"), 6, 3)
I want to further pack the data so that it occupies it's own row or column which can be done using TRANSPOSE()
and FLATTEN()
and you'll see that order will matter:
We want to apply TRANSPOSE()
first then FLATTEN()
:
=FLATTEN(TRANSPOSE(ARRAY_CONSTRAIN(IMPORTXML("https://dividendhistory.org/payout/TSX/BMO", "//table[@id='dividend_table']/*/*"), 6, 3)))
Which makes it look like the columns are combined to a stack (hehe stack):
Searching the Data
If you ever had to take a course on algorithms and data structures you might remember doing a lot of stuff related to pointers and offsets and this is why I stored the data to fit one column!
When am I going to get paid next?
I can make use of the MATCH(key, range, [search_type])
formula to search for a relative offset.
The search_type
is useful because I want to find the next payout date. To find the date that is equal to or greater than today I can set the search_type
to -1
and luckily for us the dates are already sorted in descending order:
Sample of what we're going to do with the data:
As of writing, TODAY()
is 2022-04-08
and we can see that the relative offset of the next payout date (2022-05-26
) when searching from B14
to B18
is 3
from the top, so located in B16
=MATCH(TODAY(), B14:B18, -1)
We can also get the exact row where Payout Date
starts at using ROW()
(useful if you drag things here and there instead of hardcoding):
=ROW(B13)
So now we know that the next payout date is located in cell B(13 + 3
but how can we get the information from that cell address?
We can use the INDIRECT()
function to reference a cell using an address that's in TEXT
.
Basically we want a generic way to write:
=INDIRECT("B16")
The cool thing is we can use COLUMN
and INDIRECT()
also accepts address in RyCx
format if we set the optional is_A1_notation
flag to FALSE
. Again I like using &
to build strings in sheets:
=INDIRECT("R" & SUM(B1 + B3) & "C" & COLUMN(), FALSE)
And we can do the same for getting the next Cash Amount by using a different starting offset but same relative offset:
=INDIRECT("R" & SUM(B2 + B3) & "C" & COLUMN(), FALSE)
What about the other data?
In my last post I also scraped the yield
. I also did this and with a bit of data massaging also tacked it onto the column as you can see here in my raw data spreadsheet example:
So ultimately I end up making only two calls to dividend history per ticker symbol which is much better than up-to two calls for the dividend amount, and one call for the yield. We also get all the useful data from the page.
Here is an example of what you can do with the raw data and only taking the relevant data:
There's all sorts of stuff you can do for example, taking the frequency and calculating how much dividend you could expect from a single share over a year. You can use IF()
to decide if the frequency would be 12 (monthly) or 4 (quarterly).
And here is a link to the example if you would like to make a copy and check out the stuff for yourself. It's been pre-populated with Canadian banks and some American companies that pay dividends.
Top comments (2)
Thank you. I cant believe no one has commented on this in 2 yrs. Excellent tutorial on how you used these little known (to me) functions to wrangle all that great stock and dividend information.
One glitch: NASDAQ listings seem to add one line of information to the table id="news-table"
"Next Earnings: ..."
which does not appear in TSX news table and as a result throws off your table at the bottom of Raw Dividend Data
Thanks for your kind comment and catching this bug!