DEV Community

restyler
restyler

Posted on

How to scrape Instagram followers with Node.js, put results to MySQL, and discover micro-influencers

Scraping Instagram is not easy nowadays (writing this in 2021). Instagram is really strict and getting fair amount of data from it is cumbersome - datacenter-located proxies are all banned in seconds, and accounts are flagged and get captchas. At the same time Instagram data can be very valuable for data mining. In my case, I recently needed to discover quite a big amount of IG micro-influencers to help promote some small brand across Instagram.

Who is micro-influencer?

It depends, but my definition of IG micro-influencer is: public profile of physical person, with 3k-100k followers, alive, with fresh content, and with fair amount of likes and comments on recent posts (at least 150 likes on recent posts). Discovering such profiles in certain niche can be very time consuming, so let's try to automate this!
We will scrape followers of some bigger brand competitor (let's say, we can scrape coffee brand followers base and then extract good life-style bloggers from the list). Another possible approach would be to scrape some hashtag like #latte or #arabica but this way you will probably get many coffee brands, not lifestyle bloggers - but it really depends.

I will use Node.js and cloud proxy to make the task easier.

We will also use MySQL to store all the data, and knex package to write data to the database.
Why MySQL and not csv? I prefer to put complete JSON of Instagram response to storage, because you never know which piece of JSON data you might need. And MySQL is now pretty convenient for JSON querying and extraction.

Let's go!

mkdir followers_scraper
cd followers_scraper
npm init

#fill in all fields or just click ENTER several times

npm i dotenv knex mysql2 node-fetch
Enter fullscreen mode Exit fullscreen mode

Libraries we are going to use:

  1. knex - to write data to MySQL
  2. dotenv - to load config from .env files
  3. mysql2 - knex driver
  4. node-fetch - to make HTTP requests to Instagram

Our scraper needs to have two steps:

  1. Scrape the list of followers and put them to MySQL
  2. Go through each public profile from the list and scrape amount of followers for this profile, and get the latest posts of the account to calculate engagement metrics. We will later use this data to decide if the profile might be a good micro-influencer.

Database structure design

(Create MySQL database, and execute this in your SQL editor - I use Sequel Ace on my Mac for this)
We will need 2 tables: jobs and ig_profiles.
ep2_ field prefix stands for step #2 (enrichment).
jobs.cursor field will hold Instgram pagination cursor between the requests. This will allow to proceed with followers scraping even if we kill the long-running step1 script in the middle of the scraping process.

SET NAMES utf8mb4;

CREATE TABLE `jobs` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `type` varchar(200) DEFAULT NULL,
  `userId` int(11) DEFAULT NULL,
  `input` json DEFAULT NULL,
  `cursor` varchar(255) DEFAULT NULL,
  `lastError` text,
  `reqNum` int(11) DEFAULT '0',
  `ep2_reqNum` int(11) DEFAULT NULL,
  `status` varchar(100) DEFAULT 'created',
  `ep2_status` varchar(100) DEFAULT NULL,
  `lastErrorAt` timestamp NULL DEFAULT NULL,
  `finishedAt` timestamp NULL DEFAULT NULL,
  `createdAt` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
  `pausedAt` timestamp NULL DEFAULT NULL,
  `updatedAt` timestamp NULL DEFAULT NULL,
  `ep2_pausedAt` timestamp NULL DEFAULT NULL,
  `ep2_updatedAt` timestamp NULL DEFAULT NULL
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

CREATE TABLE `ig_profiles` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `pk` varchar(200) DEFAULT NULL,
  `jobId` int(10) unsigned DEFAULT NULL,
  `username` varchar(300) DEFAULT NULL,
  `isPrivate` tinyint(1) DEFAULT NULL,
  `isBusiness` tinyint(1) DEFAULT NULL,
  `ep2_isDone` tinyint(1) NOT NULL DEFAULT '0',
  `isVerified` tinyint(1) DEFAULT NULL,
  `followerCount` bigint(20) DEFAULT NULL,
  `followingCount` bigint(20) DEFAULT NULL,
  `fullName` varchar(300) DEFAULT NULL,
  `email` varchar(200) DEFAULT NULL,
  `anonEmail` varchar(200) DEFAULT NULL,
  `phone` varchar(200) DEFAULT NULL,
  `category` varchar(200) DEFAULT NULL,
  `data` json DEFAULT NULL,
  `anonData` json DEFAULT NULL,
  `ep2_doneAt` timestamp NULL DEFAULT NULL,
  `createdAt` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
  `updatedAt` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `jobId_4` (`jobId`,`username`),
  KEY `jobId` (`jobId`),
  KEY `jobId_2` (`jobId`,`ep2_isDone`),
  KEY `jobId_3` (`jobId`,`username`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Enter fullscreen mode Exit fullscreen mode

Set up Cloud proxy

Subscribe to Cloud proxy here:
https://rapidapi.com/restyler/api/instagram40

This proxy API handles heavy task of rotating proxies and doing retries, transparently - so we don't have to deal with it in our own code.
Consider using plan with right amount of requests you might need for your scraping. Simple calculation for a 50k profile scraping:
we need ~500 requests for /followers endpoint (100 followers are usually returned for each request) and ~25000 requests for /account-info endpoint to get public profiles information.

Main files

Now, let's create 5 files in root directory of our project:

  1. .env (holds our config variables)
  2. knexfile.js (mysql config)
  3. step1.js (retrieves the list of followers)
  4. step2.js (retrieves more info on each public profile)
  5. utils.js (for reusable functions)

.env

This file will hold all the environment variables for our project

# get your API key on https://rapidapi.com/restyler/api/instagram40
RAPIDAPI_KEY=YOUR-PROXY-KEY

DB_HOST=127.0.0.1
DB_NAME=ig_scraper_demo
DB_USERNAME=ig_scraper_demo
DB_PASSWORD=ig_scraper_demo
Enter fullscreen mode Exit fullscreen mode

utils.js:

Here we will store basic functions to work with jobs.

const { production } = require('./knexfile')
const knex = require('knex')(production);


const JOB_STATUSES = {
    CREATED: 'created',
    RUNNING: 'running',
    PAUSED: 'paused',
    FINISHED: 'finished',
    FAILED: 'failed'
}

const sleep = (ms) => {
    return new Promise(resolve => setTimeout(resolve, ms));
}

const createJob = async ( { type, userId, input, status} ) => {
    let createdAt = new Date();
    return await knex('jobs').insert({
        type,
        userId,
        status,
        reqNum: 0,
        input,
        createdAt,
        updatedAt: createdAt
    })
}

const updateJob = async ( { id, cursor, status,reqNum, lastError, lastErrorAt, finishedAt, pausedAt, 
    ep2_status, ep2_pausedAt, ep2_updatedAt, ep2_reqNum } ) => {
    let updatedAt = new Date();
    await knex('jobs').update( { cursor, status, reqNum, lastError, lastErrorAt, updatedAt, finishedAt, pausedAt, 
        ep2_status, ep2_pausedAt, ep2_updatedAt, ep2_reqNum
      } ).where({ id })
}

let getJobByID = async ( id ) => {
    return await knex('jobs').select('*').where({id}).first()
}

module.exports = {

    JOB_STATUSES,
    sleep,
    createJob, 
    updateJob,
    getJobByID,

}
Enter fullscreen mode Exit fullscreen mode

knexfile.js

We will use dotenv library which loads our config from .env file

require('dotenv').config()

module.exports = {
  production: {
    client: 'mysql2',
    connection: {
      host: process.env.DB_HOST,
      timezone     : 'Z',
      database: process.env.DB_NAME,
      user: process.env.DB_USERNAME,
      password: process.env.DB_PASSWORD
    },
    migrations: {
      directory: __dirname + '/db/migrations'
    },
    pool: {
      min: 2,
      max: 10,
    }
  },
  development: {
    client: 'mysql2',
    connection: {
      host: process.env.DB_HOST,
      timezone     : 'Z',
      database: process.env.DB_NAME,
      user: process.env.DB_USERNAME,
      password: process.env.DB_PASSWORD
    },
    migrations: {
      directory: __dirname + '/db/migrations'
    },
    pool: {
      min: 2,
      max: 10,
    }
  }





}
Enter fullscreen mode Exit fullscreen mode

step1.js

Now, let's proceed with step1.js.
Note how we use cleanUpServer() function so we can conveniently stop our scraping script (which will be launched from terminal) and it will update job state in mysql before script exits.

const { production } = require('./knexfile')

const knex = require('knex')(production);

// dotenv was already run inside knexfile but we launch it again for consistency
require('dotenv').config()

const fetch = require('node-fetch')

const igutils = require('./utils')


let JOB_ID = process.env.JOB_ID
console.log('Step1, JOB ID:', JOB_ID)
let ACCOUNT = process.env.ACCOUNT
if (!ACCOUNT && !JOB_ID) throw new Error('Specify either ACCOUNT or JOB_ID env')
console.log('Step1, ACCOUNT to scrape:', ACCOUNT)
let LIMIT = process.env.LIMIT || 300




;(async () => {

let cleanUpServer = (job, eventType) => {
  console.log('updating job status before exit.. job id:', job.id, eventType)

  //if (process.stderr)
  igutils.updateJob({ id: job.id, status: igutils.JOB_STATUSES.PAUSED, pausedAt: (new Date) }).then(() => {
    process.exit(0)
  }).catch((e) => {
    console.error(e)
    process.exit(1)
  })

}


let rapidGetAccountInfoByUsername = async (account) => {
  let defaultIgHeaders = { 'x-rapidapi-key': process.env.RAPIDAPI_KEY }
  let res = await fetch('https://instagram40.p.rapidapi.com/account-info?wrap=1&username=' + encodeURIComponent(account), { headers: defaultIgHeaders })

  let json = await res.json()


  return { res, json }
}

let rapidGetFollowersByUserId = async (account, max_id) => {
  let defaultIgHeaders = { 'x-rapidapi-key': process.env.RAPIDAPI_KEY }

  let max_id_q = max_id ? '&max_id=' + encodeURIComponent(max_id) : ''
  let res = await fetch('https://instagram40.p.rapidapi.com/followers?userid=' + encodeURIComponent(account) + max_id_q, { headers: defaultIgHeaders })

  let json = await res.json()


  return { res, json }
}





  let job, accRes, ACCOUNT_ID
  let justCreated = false
  if (!JOB_ID) {


    accRes = await rapidGetAccountInfoByUsername(ACCOUNT)

    if (!accRes.res || accRes.res.status != 200 || !accRes.json) {
      let redirectLocation = null  
      console.log('Fatal error happened:', accRes)
      if (accRes.res && accRes.res.status == 302) {
          redirectLocation = accRes.res.headers.get('location')
      }
      throw new Error('failed getAccountInfo, response status: ', accRes.res ? accRes.res.status : 0, 'loc:', redirectLocation)
    }
    ACCOUNT_ID = accRes.json.graphql.user.id
    console.log('creating job..', { ACCOUNT, LIMIT, ACCOUNT_ID  })
    job = await igutils.createJob({ type: 'followers', status: igutils.JOB_STATUSES.RUNNING, input: JSON.stringify({ ACCOUNT, LIMIT, ACCOUNT_ID  }) })
    JOB_ID = job[0]

    job = await igutils.getJobByID(JOB_ID)
    console.log('new job id: ', JOB_ID)
    justCreated = true
  } else {
    console.log('proceeding job..', { JOB_ID  })
    job = await igutils.getJobByID(JOB_ID)
    if (job.finishedAt) {
      console.log('this job was finished at:', job.finishedAt, process.env.IGNORE_FINISHED ? 'proceeding' : ' exiting... pass IGNORE_FINISHED=1 to ignore')

      if (!process.env.IGNORE_FINISHED) {
        process.exit(0)
      }
    }
    ACCOUNT_ID = job.input.ACCOUNT_ID

    igutils.updateJob({ id: job.id, status: igutils.JOB_STATUSES.RUNNING })
  }
  process.on('unhandledRejection', (up) => { 
    console.error(up)
    throw up;
  });

  [`SIGINT`, `SIGUSR1`, `SIGUSR2`, `uncaughtException`, `SIGTERM`].forEach((eventType) => {
    process.on(eventType, cleanUpServer.bind(null, job, eventType))
  })

  console.log(`account id: ${ACCOUNT_ID}`)
  //let res = await fetch(`https://www.instagram.com/${username}/?__a=1`)

  console.log('starting...')
  //res = await res.json()
  let userId = ACCOUNT_ID

  let after = null, has_next = true, followersCollected = 0

  if (!justCreated) {
    after = job.cursor
    has_next = job.cursor
  }

  while (has_next) {

      let start = Date.now()
      let res, json, errMsg

      try {

        let maxAttempts = 3
        do {
          if (maxAttempts < 3) {
            await igutils.sleep(10000)
          }
          maxAttempts--
          //try {
            ({ res, json } = await rapidGetFollowersByUserId(userId, after))
            console.log('res timing:', Date.now() - start, 'ms')
            if (res.status != 200) {
              console.log('res status: ', res.status, 'location:', res.headers.get('location'))
            }
          //} catch (e) {
            //console.error('catched exception on followers req:')
            //console.error(e)
          //}

        } while (res.status != 200 && maxAttempts != 0 )

        if (res.status != 200) {
            //text = await res.text()
            console.error('invalid response status during followers request:', res.status, 'text:', json)
            throw new Error('invalid response status during followers request:' + res.status)
        }

      } catch (e) {
        console.error(e)
        //errMsg = e.message
      }

      has_next = json.next_max_id
      after = json.next_max_id
      let createdAt = new Date
      await knex('ig_profiles').insert(json.users.map((u) => {
        return {
          jobId: JOB_ID,
          pk: u.pk,
          username: u.username,
          isPrivate: u.is_private,
          fullName: u.full_name,
          isVerified: u.is_verified,
          createdAt
        }
      }))

      await igutils.updateJob({ id: JOB_ID, cursor: has_next ? has_next : '', reqNum: ++job.reqNum })

    followersCollected += json.users.length

    console.log(`collected ${followersCollected} followers`)
    if (followersCollected > LIMIT) {
      console.log('exiting due to hitting limit.')
      has_next = false
    } else {
      let additionalDelay = Math.min(followersCollected*30, 20000)
      let delay = 2000 + Math.random()*8000 + additionalDelay
      console.log(`sleeping ${Math.ceil(delay/1000)} seconds, added delay: ${(additionalDelay/1000).toFixed(2)}s`)
      await igutils.sleep(delay)
    }

  }

  await igutils.updateJob({ id: JOB_ID, status: igutils.JOB_STATUSES.FINISHED,  finishedAt: (new Date) })
  console.log('Followers collected:', followersCollected)





  process.exit(0)

})()

Enter fullscreen mode Exit fullscreen mode

Perfect! We can now run step1.js from console:

ACCOUNT=coffeebrand LIMIT=5000 node step1.js
Enter fullscreen mode Exit fullscreen mode

and it will start scraping 5000 followers from the coffeebrand account.

If we need to stop the script we can hit ctrl+C (or cmd+C on Mac) and note how jobs table will hold the status of scraping job.

step2.js

After we've scraped the list of followers let's scrape each public profile for more info!

const { production } = require('./knexfile')

const knex = require('knex')(production);

// dotenv was already run inside knexfile but we launch it again for consistency
require('dotenv').config()

const fetch = require('node-fetch')

const igutils = require('./utils')


let JOB_ID = process.env.JOB_ID
if (!JOB_ID) {
    throw new Error('Specify JOB_ID')
}
let LIMIT = process.env.LIMIT || 200
console.log('Step2, JOB ID:', JOB_ID)

let rapidGetAccountInfoByUsername = async (account) => {
    let defaultIgHeaders = { 'x-rapidapi-key': process.env.RAPIDAPI_KEY }
    let res = await fetch('https://instagram40.p.rapidapi.com/account-info?username=' + encodeURIComponent(account), { headers: defaultIgHeaders })

    let json = await res.json()

    //await fs.writeFile('account.json', JSON.stringify(data))
    //let data = JSON.parse(await fs.readFile('account.json'))

    return { res, json }
}

;(async () => {

    job = await igutils.getJobByID(JOB_ID)

    igutils.updateJob({ id: JOB_ID, ep2_status: igutils.JOB_STATUSES.RUNNING, ep2_pausedAt: null })


    let cleanUpServer = (job, eventType) => {
      console.log('updating job status before exit.. job id:', job.id, eventType)

      //if (process.stderr)
      igutils.updateJob({ id: job.id, ep2_status: igutils.JOB_STATUSES.PAUSED, ep2_pausedAt: (new Date) }).then(() => {
          process.exit(0)
      }).catch((e) => {
          console.error(e)
          process.exit(1)
      })

    }  


    process.on('unhandledRejection', (up) => { 
        console.error(up)
        throw up;
    });

    [`SIGINT`, `SIGUSR1`, `SIGUSR2`, `uncaughtException`, `SIGTERM`].forEach((eventType) => {
        process.on(eventType, cleanUpServer.bind(null, job, eventType))
    })


    let followers
    let processed = 0
    do {
        console.log('requesting new chunk...')

        followers = await knex('ig_profiles').select('*').where({
            jobId: JOB_ID,
            isPrivate: false,
            ep2_isDone: false
        }).orderBy('createdAt', 'asc').limit(5)

        console.log('followers to process (non-private)', followers.length)

        for (let follower of followers) {
            let accRes = await rapidGetAccountInfoByUsername(follower.username)

            console.log('checking account:', follower.username, 'business:', accRes.json.is_business_account)

            let u = accRes.json

            await knex('ig_profiles').where({id: follower.id}).update({
                ep2_isDone: true,
                ep2_doneAt: (new Date), 
                isBusiness: accRes.json.is_business_account,
                anonData: JSON.stringify(accRes.json),
                followingCount: u.edge_follow ? u.edge_follow.count : 0,
                followerCount: u.edge_followed_by ? u.edge_followed_by.count : 0
            })


            await igutils.updateJob({ id: JOB_ID, ep2_reqNum: ++job.ep2_reqNum, ep2_updatedAt: (new Date) })

            processed++

        }

    } while (followers.length && processed <= LIMIT)

    await igutils.updateJob({ id: JOB_ID, ep2_status: igutils.JOB_STATUSES.FINISHED, ep2_updatedAt: (new Date) })
    process.exit(0)
})()
Enter fullscreen mode Exit fullscreen mode

We can now launch second script!
JOB_ID=xxx node step2.js and it will get all the required information!

Extracting micro-influencers

Okay, now we have our database filled in with scraped profiles data. Most of information is put to anonData JSON field, and luckily MySQL has tools to extract JSON fields and we can filter out profiles by various parameters!
https://dev.mysql.com/doc/refman/5.7/en/json-function-reference.html

Before you run SQL report...

replace jobId=1 with your jobId
also, consider adapting followerCount > 3000
and (second_post_likes+second_post_comments)>150
condition for your use case.

I recommend analyzing "second_post" and not the latest one because the latest one can be posted just seconds ago and might not get all the likes and comments yet so you will end up filtering out good profiles this way.


select concat("https://instagram.com/", b.username) as link, external_url, isBusiness, followerCount, email, username, fullName, bio, post_count, second_post_likes, second_post_comments, second_post_days_ago from (select ig_profiles.*,anonData->"$.biography" as bio, anonData->>"$.external_url" as external_url, 
anonData->>"$.edge_owner_to_timeline_media.count" as post_count,  
anonData->>"$.edge_owner_to_timeline_media.edges[1].node.edge_liked_by.count" as second_post_likes,
anonData->>"$.edge_owner_to_timeline_media.edges[1].node.edge_media_to_comment.count" as second_post_comments,
FROM_UNIXTIME(anonData->>"$.edge_owner_to_timeline_media.edges[1].node.taken_at_timestamp") as second_post_time,

ROUND((UNIX_TIMESTAMP() - anonData->>"$.edge_owner_to_timeline_media.edges[1].node.taken_at_timestamp")/(60*60*24)) as second_post_days_ago
from ig_profiles where jobId=1 ) b where followerCount > 3000
and (second_post_likes+second_post_comments)>150
and second_post_days_ago<200 order by followerCount desc
Enter fullscreen mode Exit fullscreen mode

That's it! In case you are having troubles launching the code, try source code of this tutorial:
https://github.com/restyler/ig_scraper
and let me know in comments!

Video tutorial:

Discussion (0)