We continue our way from west to east with this post on scraping the Colorado Secretary of State business search. This is the eight state in the Secretary of State scraping series and it’s just…okay.
I am actually a big fan of Colorado, though. I lived there for about a year after high school and it was a great experience. Beautiful scenery, nice weather, and great people. I have some really fond memories from Colorado.
Investigation
Colorado’s search is pretty good compared to a lot of other states. It offers an advanced search, which includes a date range. This is where I started in my investigation.
I was hoping at first that I would just be able to search a blanket date range with no keywords. Unfortunately, as you can see above, I was required to use a keyword. So I started with some phrase like “insurance” to see what it would return.
When I clicked on one of those links it took me to the business details page, which turned out to just be a GET request with the parameters I needed just in the url. I knew from this point on that this would be a fairly simple scrape.
I could also clearly see that the id was a purely numeric one. This normally meant that the easiest way to find the most recently registered businesses would be to just take the id and increment it every day to get the newest businesses.
The code
const url = `https://www.sos.state.co.us/biz/BusinessEntityDetail.do?quitButtonDestination=BusinessEntityResults&nameTyp=ENT&masterFileId=${id}&srchTyp=ENTITY`;
const axiosResponse = await axios.get(url);
I started off with just a simple axios GET request with the url it included. This was not successful. It did not return the data I expected. It wasn’t giving me an error but I couldn’t easily pick out what the problem was. The response just looked like a tiny bit of header html and then a bunch of javascript.
I was a little frustrated at first. There was not anything that I could see that seemed to be different in my request. Finally, I just decided to go with puppeteer. I was so confident that puppeteer would work I just ran it as headless. Starting it up, though, and it also returned nothing. Now I knew something was weird. I booted it up headless and voilá. It said my session had expired after two hours.
Now I felt silly. I was almost positive this was because I didn’t attach the cookie. I did so like this:
const url = `https://www.sos.state.co.us/biz/BusinessEntityDetail.do?quitButtonDestination=BusinessEntityResults&nameTyp=ENT&masterFileId=${id}&srchTyp=ENTITY`;
const axiosResponse = await axios.get(url, {
headers: {
"cookie": "_ga=GA1.3.1381612432.1583500038; TS01f3ddad=01c6cfed7058d3570a7b8344ca028cd52010a1bcf052a882a6195931d8d4801dcbc519811c; _gid=GA1.3.271714947.1584784465; _gcl_au=1.1.1920694888.1584784465; menuheaders=2c; JSESSIONID=0000-lUDWo0hoJF3KBZaTQhz2ky:1b2r5m433; TS01132dd1=01c6cfed70e9ee8af77449cdde2589a53daffb8b6a7e3c32cc35681d8cf2b339e867af342cb703972c8ec98ae365ee36bdf5252efe; TS01132dd1_31=01526889668b923b0ce8fbdc5bbe11681884b4cc6d40141358085bb74c48f95130a2d87e7e0f8e8253f5fff6da0b4baf2c9cf5b86b; TS01132dd1_77=08b7f17dc7ab280060a8e77ff41a702cebf64b14f9b271745c648c87e299d967048fbefd790857a34527f2366b0a851a08c277556d8238006bcd64b36d369c03b866c4d7cb9bf11cc86207c2ca16057ca0063033673c043daefbec33b0fcfc4e998b9c0d9ab1dfafdd374e5b20550eb6"
}
});
And…it all worked. No problem. I think it is noteworthy that it said my session expired after two hours. This tells me that maybe this cookie will not always be valid and I’ll have to check it if I ever don’t have luck in the future.
Incrementing to find the end
From here I used my classic technique to try and find where the end of the ids where. I already had a very good idea that I was at the end since I was able to search by a date range to begin with.
const id = 20201243843;
for (let i = 0; i < 20; i++) {
const increment = 3;
try {
await getBusinessDetails(id + (i * increment));
}
catch (e) {
console.log('something broke', id + (i * increment));
}
}
I would just set up a simple 20x loop and increment the id by some number. Then I started with 100 since I knew I was close to the end and would check the date of incorporation of each business. If they are ascending, I know I’m on the right track. I keep going this way until I stop getting businesses. Taking last highest number and then reducing the increment to 10 and try again, doing the same thing. I keep this up until I’m incrementing by 1 and I hit the end.
Hitting the end turned out to be a bit more difficult than it was with New Mexico and other states where I have used this approach. It also is why this scrape for business leads isn’t quite as good as others. I’m not sure the reason but Colorado seems to have gaps between ids.
As you can see above, a bunch of ids empty and then a business. This was what it was like for pretty much all of the ids. The dates were definitely ascending but it wasn’t a guarantee that incrementing by one would always hit another business.
Selecting with CSS
A lot of these secretary of state pages use mostly an older style of html formatting with a lot of tables and very little in the way of CSS selectors. As a result, I’ve become pretty good at using just nth-of-type
to just select through the tables, table rows, and table cells.
This scrape was no exception. The css looks huge but it actually makes quite a bit of sense and is really easy to test in the browser.
const title = $('form > table >tbody > tr> td> table > tbody > tr:nth-of-type(1) tr:nth-of-type(2) td:nth-of-type(2)').text();
const date = $('form > table >tbody > tr> td> table > tbody > tr:nth-of-type(1) tr:nth-of-type(3) td:nth-of-type(4)').text();
const status = $('form > table >tbody > tr> td> table > tbody > tr:nth-of-type(1) tr:nth-of-type(3) td:nth-of-type(2)').text();
const address = $('form > table >tbody > tr> td> table > tbody > tr:nth-of-type(1) tr:nth-of-type(6) td:nth-of-type(2)').text();
const agent = $('form > table >tbody > tr> td> table > tbody > tr:nth-of-type(7) tr:nth-of-type(2) td:nth-of-type(2)').text();
I just keep trying the selectors until I find a specific instance of what I’m looking for. It’s starting to feel almost fool proof.
And…honestly, that’s it. Here’s the final function but it’s pretty simple. Make the request, use cheerio, parse the html. Colorado displayed something when the business id wasn’t found and since there were a lot of those, I watched for that message and would log it out when I hit it.
async function getBusinessDetails(id: number) {
const url = `https://www.sos.state.co.us/biz/BusinessEntityDetail.do?quitButtonDestination=BusinessEntityResults&nameTyp=ENT&masterFileId=${id}&srchTyp=ENTITY`;
const axiosResponse = await axios.get(url, {
headers: {
"cookie": "_ga=GA1.3.1381612432.1583500038; TS01f3ddad=01c6cfed7058d3570a7b8344ca028cd52010a1bcf052a882a6195931d8d4801dcbc519811c; _gid=GA1.3.271714947.1584784465; _gcl_au=1.1.1920694888.1584784465; menuheaders=2c; JSESSIONID=0000-lUDWo0hoJF3KBZaTQhz2ky:1b2r5m433; TS01132dd1=01c6cfed70e9ee8af77449cdde2589a53daffb8b6a7e3c32cc35681d8cf2b339e867af342cb703972c8ec98ae365ee36bdf5252efe; TS01132dd1_31=01526889668b923b0ce8fbdc5bbe11681884b4cc6d40141358085bb74c48f95130a2d87e7e0f8e8253f5fff6da0b4baf2c9cf5b86b; TS01132dd1_77=08b7f17dc7ab280060a8e77ff41a702cebf64b14f9b271745c648c87e299d967048fbefd790857a34527f2366b0a851a08c277556d8238006bcd64b36d369c03b866c4d7cb9bf11cc86207c2ca16057ca0063033673c043daefbec33b0fcfc4e998b9c0d9ab1dfafdd374e5b20550eb6"
}
});
const $ = cheerio.load(axiosResponse.data);
const title = $('form > table >tbody > tr> td> table > tbody > tr:nth-of-type(1) tr:nth-of-type(2) td:nth-of-type(2)').text();
const date = $('form > table >tbody > tr> td> table > tbody > tr:nth-of-type(1) tr:nth-of-type(3) td:nth-of-type(4)').text();
const status = $('form > table >tbody > tr> td> table > tbody > tr:nth-of-type(1) tr:nth-of-type(3) td:nth-of-type(2)').text();
const address = $('form > table >tbody > tr> td> table > tbody > tr:nth-of-type(1) tr:nth-of-type(6) td:nth-of-type(2)').text();
const agent = $('form > table >tbody > tr> td> table > tbody > tr:nth-of-type(7) tr:nth-of-type(2) td:nth-of-type(2)').text();
const pageMessages = $('#pageMessages .page_messages').text();
const business = {
title: title,
date: date,
status: status,
address: address,
agent: agent,
id: id
};
if (pageMessages) {
console.log('Looks like not found', pageMessages.trim(), id);
}
else {
console.log('response', business);
}
}
Looking for business leads?
Using the techniques talked about here at javascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome business leads. Learn more at Cobalt Intelligence!
The post Jordan Scrapes Secretary of States: Colorado appeared first on JavaScript Web Scraping Guy.
Top comments (0)