DEV Community

Jordan Hansen
Jordan Hansen

Posted on • Originally published at javascriptwebscrapingguy.com on

Jordan Scrapes Secretary of State: North Carolina

Demo code here

Today we do web scraping on the North Carolina Secratary of State. I’ve been to North Carolina once and it seemed like a great state. Really pretty with some beautiful beaches. This is the 15th (!!) entry in the Secretary of States web scraping series.

Investigation

North Carolina Michael Jackson fun gif

I try to look for the most recently registered businesses. They are the businesses that very likely are trying to get setup with new services and products and probably don’t have existing relationships. I think typically these are going to be the more valuable leads.

If the state doesn’t offer a date range with which to search, I’ve discovered a trick that works pretty okay. I just search for “2020”. 2020 is kind of a catchy number and because we are currently in that year people tend to start businesses that have that name in it.

Once I find one of these that is registered recently, I look for a business id somewhere. It’s typically a query parameter in the url or form data in the POST request. Either way, if I can increment that id by one number and still get a company that is recently registered, I know I can find recently registered business simply by increasing the id with which I search.

North Carolina was not much different. They allow you to search with pretty standard stuff. No date range, sadly, so the above “2020” trick works pretty well.

North Carolina secretary of state business search for 2020

Bingo, just like that we find a business registered in July of this year. Worked like a charm.

Whenever I’m first investigating a site, I always check the network requests. Often you can find that there are direct requests to an API that has the data you need. When I selected this company, 2020 Analytics LLC, I saw this network request and I thought I was in business.

ajax request for the business registration profile

This request didn’t return any easy to parse JSON, sadly, only HTML. Still, I should be able to POST that Sos ID here to this request and get what I wanted and just increment from there.

Maybe you’re seeing what I missed.

Database id vs Sos id

North Carolina pretty gif

The id shown in that photo was a lot bigger than the Secretary of State id. 16199332 vs 2006637. I started making requests and plucking out the filing date and the business title starting with 16199332.

The results were pretty intermittent. The first indication that something was up was that the numbers weren’t exatly sequential. One business would be registered on 7/21/2020 and then 10 numbers later a business was registered on 6/24/2020.

I’m not exactly sure programmatically what is happening that they are making entries into the database like that. In any case, I soon realized that something wasn’t matching up.

I wanted to call directly to this details page but for that I needed to get the database id somehow. Fortunately, North Carolina has a way to search by Sos id.

Search by Sos id

The resulting HTML looks like this:

HTML for sos id search

Because I’m searching by Sos id it only returned on result. I just grabbed and parsed this anchor tag to pluck out the database id from that ShowProfile function. Two requests, one to get the database id, another to use that database id to get the business details

The code

north carolina fun storm

(async () => {
    const startingSosId = 2011748;

    // Increment by large amounts so we can find the most recently registered businesses
    for (let i = 0; i < 5000; i += 100) {
        // We use the query post to get the database id
        const databaseId = await getDatabaseId(startingSosId + i);

        // With the database id we can just POST directly to the details endpoint
        if (databaseId) {
            await getBusinessDetails(databaseId);
        }

        // Good neighbor timeout
        await timeout(1000);
    }
})();
Enter fullscreen mode Exit fullscreen mode

This is the base of my scraping code. This showcases how I’m incrementing by larger jumps to be able to quickly determine where the end is. I go out and get the database id and then use that to get the business details

async function getDatabaseId(sosId: number) {
    const url = 'https://www.sosnc.gov/online_services/search/Business_Registration_Results';

    const formData = new FormData();
    formData.append('SearchCriteria', sosId.toString());
    formData.append(' __RequestVerificationToken', 'qnPxLQeaFPiEj4f1so7zWF8e5pTwiW0Ur8A0qkiK_45A_3TL__ wTjYlmaBmvWvYJVd2GiFppbLB39eD0F6bmbEUFsQc1');
    formData.append('CorpSearchType', 'CORPORATION');
    formData.append('EntityType', 'ORGANIZATION');
    formData.append('Words', 'SOSID');

    const axiosResponse = await axios.post(url, formData,
        {
            headers: formData.getHeaders()
        });

    const $ = cheerio.load(axiosResponse.data);

    const onclickAttrib = $('.double tbody tr td a').attr('onclick');
    if (onclickAttrib) {
        const databaseId = onclickAttrib.split("ShowProfile('")[1].replace("')", '');

        return databaseId;
    }
    else {
        console.log('No business found for SosId', sosId);
        return null;
    }
}
Enter fullscreen mode Exit fullscreen mode

Getting the database id looks like this. Simply selecting that anchor tag shown above and parsing the function to grab the database id.

The most enjoyable part was working the business details. This section here had a lot of the data that I wanted but they weren’t always in the same order. The company didn’t always have the same fields.

North Carolina Secretary of State information fields

So I used a trick I’ve used before where I just loop through all of the elements in this section, get the text from the label section, and put the value where it needs to go based on that label.

const informationFields = $('.printFloatLeft section:nth-of-type(2) div:nth-of-type(1) span');

for (let i = 0; i < informationFields.length; i++) {
    if (informationFields[i].attribs.class === 'greenLabel') {
        // This is kind of perverting cheerio objects
        const label = informationFields[i].children[0].data.trim();
        const value = informationFields[i + 1].children[0].data.trim();

        switch (label) {
            case 'SosId:':
                business.sosId = value;
                break;
            case 'Citizenship:':
                business.citizenShip = value;
                break;
            case 'Status:':
                business.status = value;
                break;
            case 'Date Formed:':
                business.filingDate = value;
                break;
            default:
                break;
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

I had to a do little almost abuse of cheerio’s normally very easy API. The problem was at the top you can see that I’m selecting all the spans in this information section. I needed to loop through each one and I couldn’t find a way to access to text() function without using a proper css selector. For example, $('something').text() easy. But as I looped I didn’t want to select any further. I wanted that element. And that’s why I ended up with children[0].data.

Here’s the full function:

async function getBusinessDetails(databaseId: string) {
    const url = 'https://www.sosnc.gov/online_services/search/_Business_Registration_profile';

    const formData = new FormData();
    formData.append('Id', databaseId);
    const axiosResponse = await axios.post(url, formData,
        {
            headers: formData.getHeaders()
        });

    const $ = cheerio.load(axiosResponse.data);

    const business: any = {
        businessId: databaseId
    };

    business.title = $('.printFloatLeft section:nth-of-type(1) div:nth-of-type(1) span:nth-of-type(2)').text();
    if (business.title) {
        business.title = business.title.replace(/\n/g, '').trim()
    }
    else {
        console.log('No business title found. Likely no business here', databaseId);
        return;
    }
    const informationFields = $('.printFloatLeft section:nth-of-type(2) div:nth-of-type(1) span');

    for (let i = 0; i < informationFields.length; i++) {
        if (informationFields[i].attribs.class === 'greenLabel') {
            // This is kind of perverting cheerio objects
            const label = informationFields[i].children[0].data.trim();
            const value = informationFields[i + 1].children[0].data.trim();

            switch (label) {
                case 'SosId:':
                    business.sosId = value;
                    break;
                case 'Citizenship:':
                    business.citizenShip = value;
                    break;
                case 'Status:':
                    business.status = value;
                    break;
                case 'Date Formed:':
                    business.filingDate = value;
                    break;
                default:
                    break;
            }
        }
    }

    console.log('business', business);
} 
Enter fullscreen mode Exit fullscreen mode

And…that’s it! It turned out pretty nice.

Looking for business leads?

Using the techniques talked about here at javascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome web data. Learn more at Cobalt Intelligence!

The post Jordan Scrapes Secretary of State: North Carolina appeared first on JavaScript Web Scraping Guy.

Top comments (0)