Okay, I’ll admit it. I really don’t know anything about West Virginia. I’m still scraping its secretary of state for business leads. If you look at a map, it’s definitely west of Virginia so the name checks out.
I choose it at random for scraping and it turned out to be an easy scrape using some of the techniques that I’ve built over the other secretary of state pages I’ve scraped.
Investigation
I try to look for the most recently registered businesses. They are the businesses that very likely are trying to get setup with new services and products and probably don’t have existing relationships. I think typically these are going to be the more valuable leads.
If the state doesn’t offer a date range with which to search, I’ve discovered a trick that works pretty okay. I just search for “2020”. 2020 is kind of a catchy number and because we are currently in that year people tend to start businesses that have that name in it.
Once I find one of these that is registered recently, I look for a business id somewhere. It’s typically a query parameter in the url or form data in the POST request. Either way, if I can increment that id by one number and still get a company that is recently registered, I know I can find recently registered business simply by increasing the id with which I search.
West Virginia, fortunately, had an advanced search which included adding a date range.
Selecting any of these revealed what I was looking for. A business id in the query parameter that appeared to be numeric. Incrementing it by one shows another recently registered business. BAM. Newly registered businesses found.
The code
This part is crazy simple. I depend on Axios to make the get request and cheerio to parse the html. I start with a basic function looping through 20 ids to check that they are indeed incrementing.
(async () => {
// const startingId = 11045521;
const startingId = 493294;
for (let i = 0; i < 20; i++) {
await getBusinessDetails(startingId + i);
}
})();
And then the getBusinessDetails function just takes the id, makes the get request with the incremented id and gets the fields we want.
async function getBusinessDetails(id: number) {
const url = `https://apps.sos.wv.gov/business/corporations/organization.aspx?org=${id}`;
const axiosResponse = await axios.get(url);
const $ = cheerio.load(axiosResponse.data);
const title = $('#lblOrg').text();
const date = $('table:nth-of-type(1) tr:nth-of-type(3) td:nth-of-type(4)').text();
const address = $('table:nth-of-type(3) tr:nth-of-type(3) td:nth-of-type(1)').text();
const officer = $('table:nth-of-type(4) tr:nth-of-type(3) td:nth-of-type(1)').text();
const business = {
title: title,
date: date,
address: address,
officer: officer
};
console.log('business', business);
}
The html is super simple here. Each section of data is within a table so I use nth-of-type
to find the one I want and then I just pluck from the rows and cells to grab the data I want from those. Very simple scrape. The end.
These posts are starting to get smaller, it seems. I think this is partially because I’m getting better at this. If I’m missing some things that you would be intersted in, please let me know and I’ll be happy to go into more depth.
Looking for business leads?
Using the techniques talked about here at javascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome business leads. Learn more at Cobalt Intelligence!
The post Jordan Scrapes Secretary of States: West Virginia appeared first on JavaScript Web Scraping Guy.
Top comments (0)