Scraping the Ohio secretary of state is where I am headed today. It proved to be one of the more interesting secretary of state scrapes. This is part of the Secretary of State series.
While I have been to Ohio, I’ve only been to smaller parts of it and mostly just passed through. I chose it because it is the 7th most populated state.
Investigation
I try to look for the most recently registered businesses. They are the businesses that very likely are trying to get setup with new services and products and probably don’t have existing relationships. I think typically these are going to be the more valuable leads.
If the state doesn’t offer a date range with which to search, I’ve discovered a trick that works pretty okay. I just search for “2020”. 2020 is kind of a catchy number and because we are currently in that year people tend to start businesses that have that name in it.
Once I find one of these that is registered recently, I look for a business id somewhere. It’s typically a query parameter in the url or form data in the POST request. Either way, if I can increment that id by one number and still get a company that is recently registered, I know I can find recently registered business simply by increasing the id with which I search.
Ohio was a little bit different than my typical investigation. While the registered ids are incrementing where the bigger number is more recently registered there were some other caveats.
I immediately loved the search that Ohio provides. You can select whether you want active businesses or not (which I do) and then you can sort! This made it very easy to see that the entity numbers were incrementing with more recent being the bigger number.
Selecting the “Show Details” button opened a modal! And where there is a modal there is ajax. And with ajax, there is often a chance for json. And…look!
Business details! Sweet. But look at that endpoint. What is that?
https://businesssearchapi.ohiosos.gov/Rtj0lqmmno6vaBwbRxU7TunJY6RmAt0bipK4478286?_=1590672273857
That number on the end definitely wasn’t the entity number. I thought at first maybe it was a database id and tried incrementing it without any luck. Finally I realized that the entity number was indeed in there, I was just overlooking it.
https://businesssearchapi.ohiosos.gov/Rtj0lqmmno6vaBwbRxU7TunJY6RmAt0bipK 4478286?_=1590672273857
The rest…well, I just don’t know.
Endpoints
I happened to notice another ajax call being made that was interesting.
https://businesssearch.ohiosos.gov/ajax/endPoints.json?_=1590672273856
And the results were just a bunch of urls. One of the urls is what is used to get the details ajax call. I attempted all of these urls following the same format as the details url. For example: https://businesssearchapi.ohiosos.gov/zyjLcCmoqeZffOs1ajJdsiek3tmuj9QtZVn 4478286
They are named “div” something so I’m guessing that if this business has, for example, a priorname then that endpoint would be called and use the resulting information. The business I tested against had no need of this and so it just returned empty data.
The code
Now for the pretty easy part. When I made basic axios requests, the api immediately returned 503s. I started adding in headers and the minimum ones that are required are origin and user-agent.
const axiosResponse = await axios.get(url, {
headers: {
origin: 'https://businesssearch.ohiosos.gov',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
}
});
I tried the origin as “https://pizza.com” and it didn’t work. It has to be that specific origin.
Once I started looping through entity numbers and incrementing, I would very get 503s occasionally. It wasn’t consistently the same businesses, however. It was often just every other one. This led me to find that Ohio does some kind of throttling. If you make a request more often than once every 1-1.5 seconds, it 503s it. I added a 2 second timeout and didn’t have any further problems.
const startingId = 4479034;
for (let i = 0; i < 25; i++) {
const url = `https://businesssearchapi.ohiosos.gov/Rtj0lqmmno6vaBwbRxU7TunJY6RmAt0bipK${startingId + i}?_=1590620350441`;
try {
const axiosResponse = await axios.get(url, {
headers: {
origin: 'https://businesssearch.ohiosos.gov',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
}
});
const businessData = axiosResponse.data.data;
console.log('Registrant', businessData[1].registrant[0].charter_num, businessData[1].registrant[0].contact_name);
console.log('First panel', businessData[4].firstpanel[0].business_name, businessData[4].firstpanel[0].effect_date);
// Other stuff that maybe you would want
// .registrant[0].effective_date_time,
// axiosResponse.data.data[1].registrant[0].charter_num,
// axiosResponse.data.data[1].registrant[0].effective_date_time,
// axiosResponse.data.data[4].firstPanel[0].business_name
}
catch (e) {
console.log('Error', e.response ? e.response.status : e);
}
await timeout(2000);
}
And…done. From there it was just parsing the JSON. Very simple. Fun scrape!
Looking for business leads?
Using the techniques talked about here at javascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome business leads. Learn more at Cobalt Intelligence!
The post Jordan Scrapes Secretary of States: Ohio appeared first on JavaScript Web Scraping Guy.
Top comments (0)