Today we are scraping the New Mexico Secretary of State business search. It was certainly no gem like Washington was but it still offered some pretty neat things. This is the seventh article in the Secretary of State scraping series.
New Mexico may seem like an odd choice but I am from the westernish part of the United States. I’ve already done Oregon, Washington, California, Idaho, and Wyoming. I touched on Utah and Nevada and I may revisit them but they weren’t too fun from the little I looked.
Investigation
I always start off any scrape with a basic investigation of what it will take to scrape the data. With secretary of state pages, this begins at the search page. The New Mexico search page is just okay. It has a reCaptcha which could potentially be scary though it definitely wouldn’t stop me. See my post on how to avoid being blocked while web scraping.
My general goal when scraping secretary of states is to first see if I can get newly registered businesses. These are the businesses that because they are new are more likely not to have an established provider for things like business liability insurance or credit card services.
With that goal in mind, I proceeded with a search. I like to search for businesses with names that contain “2020” since this is more likely to be a more recent business.
From here, I check one of the most likely businesses that I suspected to be a recent business, 2020 Investment Inc. Looking at the network request for this business gave me what I was looking for…a business id. And it is important to notice that the businessId shown here is not the same as the public entityId.
With a number like this and the form data, it’s time to start testing to see if doing some simple incrementation will yield in some gradually increasing numbers.
The investigation code
Before getting into the function that actually calls, I’d like to talk a bit about the strategy I used to find the upper limit of the ids.
for (let i = 0; i < 20; i++) {
const increment = 10000;
try {
await getBusinessDetails(id + (i * increment));
}
catch (e) {
console.log('something broke', id + (i * increment));
}
}
I would just set up a simple 20x loop and increment the id by some number. I started with 10k and would check the date of incorporation of each business. If they are ascending, I know I’m on the right track. I keep going this way until I stop getting businesses. Taking last highest number and then reducing the increment to 1000 and try again, doing the same thing. I keep this up until I’m incrementing by 1 and I hit the end.
With this method, I was able to determine that as of March 20, 2020, the last id was 612241 and it was registered on March 18, 2020. It should be noted that while the rule is generally true that the higher id is a later registration, it’s not always exactly true. For example, id 612240 was registered on March 20, 2020.
It is interesting to note that there were some ids that ended in March of 2016 with 382227. There was a huge gap in numbers after that until 500k+ where the ids picked back up.
The data code
So they had the reCaptcha on the business search page but it really didn’t end up affecting me at all. I just made direct calls to the details page with the same payload that is was made in the browser.
const url = 'https://portal.sos.state.nm.us/BFS/online/corporationbusinesssearch/CorporationBusinessInformation';
const payload = `txtCommonPageNo=&hdnTotalPgCount=424&txtCommonPageNo=&hdnTotalPgCount=5&businessId=${id}&__RequestVerificationToken=WezJVY0GvxnAWyB0gLF74dWyoHimmADXqtBQ6wMp9U2RZKi6zBFQaoH2MmUFwKnuSZK2ZU5RsapHKPaA0q2DP5r3zWFIkYW0Aq5pYfKy1uY1`;
const axiosResponse = await axios.post(url, payload);
The id is the number we pass in to this function.
From here, it was just using Cheerio and css selectors. While the explanation of this is simple, the actual choice of css seletors was a little tricky.
There weren’t any unique selectors for the any of the fields I used, so it was just getting the nth-of-type
for all of the different tr
and table
s. There were also several different dates of incorporation it could be.
Here’s the code I ended up using. I’m warning now, these are big css selectors.
const $ = cheerio.load(axiosResponse.data);
const entityId = $('.right_content > table tbody > tr:nth-of-type(3) table:nth-of-type(1) tr:nth-of-type(2) td:nth-of-type(2) strong').text();
const entityName = $('.right_content > table tbody > tr:nth-of-type(3) table:nth-of-type(1) tr:nth-of-type(3) td:nth-of-type(2) strong').text();
const address = $('.right_content > table > tbody > tr:nth-of-type(5) tr:nth-of-type(3) strong').text();
const registeredAgentName = $('.right_content > table > tbody > tr:nth-of-type(6) tr:nth-of-type(2) strong').text();
let dateOfIncorporation = $('.right_content > table tbody > tr:nth-of-type(3) table:nth-of-type(3) tr:nth-of-type(4) td:nth-of-type(2) strong strong').text();
if (!dateOfIncorporation || dateOfIncorporation.trim() === 'Not Applicable') {
dateOfIncorporation = $('.right_content > table tbody > tr:nth-of-type(3) table:nth-of-type(3) tr:nth-of-type(2) td:nth-of-type(2) strong').text();
}
if (!dateOfIncorporation || dateOfIncorporation.trim() === 'Not Applicable') {
dateOfIncorporation = $('.right_content > table tbody > tr:nth-of-type(3) table:nth-of-type(3) tr:nth-of-type(3) td:nth-of-type(4) strong').text();
}
if (!dateOfIncorporation || dateOfIncorporation.trim() === 'Not Applicable') {
dateOfIncorporation = $('.right_content > table tbody > tr:nth-of-type(3) table:nth-of-type(3) tr:nth-of-type(3) td:nth-of-type(2) strong').text();
}
if (!dateOfIncorporation || dateOfIncorporation.trim() === 'Not Applicable') {
dateOfIncorporation = $('.right_content > table tbody > tr:nth-of-type(3) table:nth-of-type(3) tr:nth-of-type(2) td:nth-of-type(4) strong').text();
}
const business = {
entityId: entityId,
entityName: entityName,
address: address,
dateOfIncorporation: dateOfIncorporation,
dbId: id,
agent: registeredAgentName
};
console.log('business', business);
And….that’s it!
Looking for business leads?
Using the techniques talked about here at javascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome business leads. Learn more at Cobalt Intelligence!
The post Jordan Scrapes Secretary of States: New Mexico appeared first on JavaScript Web Scraping Guy.
Top comments (0)