It is time for episode 19 in the Secretary of State scraping series. Today we do some web scraping of the Arkansas Secretary of State, found here. I really don’t know much about Arkansas but that featured image certainly does look gorgeous.
Investigation
I try to look for the most recently registered businesses. They are the businesses that very likely are trying to get setup with new services and products and probably don’t have existing relationships. I think typically these are going to be the more valuable leads.
If the state doesn’t offer a date range with which to search, I’ve discovered a trick that works pretty okay. I just search for “2020”. 2020 is kind of a catchy number and because we are currently in that year people tend to start businesses that have that name in it.
Once I find one of these that is registered recently, I look for a business id somewhere. It’s typically a query parameter in the url or form data in the POST request. Either way, if I can increment that id by one number and still get a company that is recently registered, I know I can find recently registered business simply by increasing the id with which I search.
This is exactly the tactic used in Arkansas:
Searching for 2020 reveals a list of businesses with 2020 in the name. Going through a few finds one that is registered recently, only a few months ago.
Now let’s look at the details page for this business.
Bam. We’re in business. You can see an id in the url. Incrementing that showed that businesses become more recent as the number got bigger.
Finding businsesses with the time tested method worked like a charm.
The code
The code is simple. We just loop through ids and then parse the html.
(async () => {
const startingId = 566000;
for (let i = 0; i <= 20; i += 1) {
await getDetails(startingId + i);
//Longer timeout needed because of DDOS protection from website
await timeout(3000);
}
})();
We added a longer wait time here to ensure that we aren’t risking getting blocked. Three seconds may be longer than you need and you can adjust that to whatever you need.
In this example we just loop through 20 times but if you get the newly registered businesses daily you want to stop once it stops finding new businesses.
The details code is also very simple.
async function getDetails(sosId: number) {
const axiosResponse = await axios.get(`https://www.sos.arkansas.gov/corps/search_corps.php?DETAIL=${sosId}`);
const $ = cheerio.load(axiosResponse.data);
const title = $("tr:nth-of-type(2) td:nth-of-type(2)").text();
const formationDate = $("tr:nth-of-type(11) td:nth-of-type(2)").text();
const status = $("tr:nth-of-type(7) td:nth-of-type(2)").text();
const agentName = $("tr:nth-of-type(9) td:nth-of-type(2)").text();
const address = $("tr:nth-of-type(8) td:nth-of-type(2)").text();
const business: any = {};
business.title = title;
business.formationDate = formationDate;
business.sosId = sosId;
business.status = status;
business.agentName = agentName;
business.address = address;
console.log("business", business);
}
Arkansas was a very cut and dry scrape, which was nice. The end!
Looking for business leads?
Using the techniques talked about here at javascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome web data. Learn more at Cobalt Intelligence!
The post Jordan Scrapes Secretary of State: Arkansas appeared first on JavaScript Web Scraping Guy.
Top comments (0)