DEV Community

Ricardo A. Mercado
Ricardo A. Mercado

Posted on • Updated on

How I made a web scraper because LinkedIn

Having lots of LinkedIn connections can be convenient for many people. You and your connection agreed to be connected through the platform, thus sharing some public information including your email (in most cases, you can choose not to though). This is all nice and dandy until you actually want to use all the data you have from your connections... Depending on what data you want...

Problem

Let's say you want to export all of your connections' data from LinkedIn, you can do this by following their instructions found here. It generates a CSV file containing the following information from each connection:
First Name, Last Name, Email Address, Company, Position, Connected On

So what's the issue here? Well even though it gives you a Email Address column on the csv, it doesn't really provide any of your connections' emails! I guess they used to provided it and never updated the export csv to remove that column. I also checked out their public API and found nothing related you connections emails, but I did find this StackOverflow discussion which indicated that they in fact used to provided that info, but now they do not. WTF LinkedIn? So I just decided to just scrape all of my connections' emails. I mean, I can access them manually, but it would take a shit load of time to get all of my 2000+ connection emails.

Solution

What did I need the script to do to achieve this? Well first I needed it to log in, then search the connection's name, enter it's profile page, and get the email. Simple... right?

1st Attempt

By using LinkedIn's search input getting the emails was working until they semi-blocked my account for suspicious behavior due to too many search requests. This was about 500 connections in.

2nd Attempt

Maybe I just have to be more careful with the amount of searches between x amount of time. So I added the option to set an interval (default to 1 hour) and to set the amount of emails to search between each interval (default to 50).

LinkedIn are some sneaky bastards, they semi-blocked me again! I searched for information on this semi-blocking and found that is specifically designed to avoid automated bots to do stuff on the site. Great....

3rd Attempt

I thought that maybe the search limit only applied to general searches, so let's try clicking directly on the connection when it appears on the suggestion box that appears after typing in the connection's name.

Well, turns out the library I'm using to scrape the page (NightmareJS) did not detect that DOM element, so I couldn't do anything with it. sigh....

4th and Final Attempt


After some head scratching and some thought to just quit the little project I finally came up with another approach... Going directly to my connections section, and using the connections search input, which only searches my connections. And this finally worked with no search limit!!

After all emails are scraped I just create a email.txt file with all the emails in there. And that was it!

TL;DR
I wanted to get all of my LinkedIn connections' emails. LinkedIn does not allow an option to retrieve them by exporting your connections data, so I created a web scraper to get them.

For anyone interested in checking out the script, you can access it here.

NOTE

If LinkedIn updates their page and changes the class of an element used in the script it will stop working. You can check out the source code and verify if any class has changed on LinkedIn and update the script to make it work again.

Thanks for reading!

Top comments (27)

Collapse
 
thebouv profile image
Anthony Bouvier

1) Why? What use is having all of their emails? Especially 2000+ of them at once? Maybe this is why LinkedIn stopped exporting that data?

2) You know you broke the User Agreement, right?

linkedin.com/legal/user-agreement and search for "scrape".

I'm a big fan of scrapers. I've written tons of them too.

But you have to pay attention to TOS/EULAs/etc.

Collapse
 
futoricky profile image
Ricardo A. Mercado

1) If you can't think of a use of having all of their emails, doesn't mean there aren't uses for having them.

2) I guess they'll have to suspend/ban me.

Collapse
 
thebouv profile image
Anthony Bouvier

1) I didn't say there aren't uses. I asked what yours was. Since we're having a technical discussion, I figured the typical "why am I doing this" would be a good part of the back and forth. As you mentioned in the article you think they used to export this info, but stopped. So maybe this is a time to step back and say "should I?". Also a healthy part of the discussion.

2) I suppose. Rather, I think it'd be best to once again examine the possible why and note that you are purposefully breaking an agreement you signed up for. For a fun comparison, what are the terms of service or user agreement used by AccountBerry? Do you have a similar agreement that might not allow for scraping either? And what if someone did anyway? You may not notice, but what if you did because they coded in error and slammed your system?

Like I've said, I've created lots of spiders/bots/scrapers. It is fun. And there are great reasons to make them.

But a discussion of the ethics of building them to use to scrape data from sites that you agreed not to scrape is an interesting article-worthy thing to think about. Hopefully an aspiring scraper-maker reads your article and this discussion and keeps it in mind.

Thread Thread
 
futoricky profile image
Ricardo A. Mercado

1) Can't be too specific, but is for data analytics purposes. Why wouldn't they want them to be exported if I could get them by going to each connection one by one manually? The scraper basically automates that tedious process. I mean, connections agreed to share certain info, and email is just one of that information (they could even set it so the email is not shown).

2) I completely understand your point and I agree completely. I did break the agreement unknowingly (until you pointed it out), but there was no malicious intent. I only automatized a process I am allowed to do manually. I find that if you write some code to automatize a process you can achieve manually, then there shouldn't be no restriction to it. It's like a post I read yesterday, a person had 400 unread messages and couldn't select them all to mark them as read, so he just opened the dev tools and wrote a simple code to loop through all the messages and click them. My response "I guess they'll have to suspend/ban me." is based on that what is done is done.

Maybe adding "For educational purposes" changes the whole context of what is written?

Thread Thread
 
ermirbeqiraj profile image
Ermir Beqiraj

Lol.. "For educational purposes" & "Don't try this at home, especially in the kitchen"

Collapse
 
aarmora profile image
Jordan Hansen

I actually loved this. Nice article.

So do you perform a login with Nightmarejs and then just search from there?

I realize it's against TOS but I do believe it's still legal

arstechnica.com/tech-policy/2017/0...

The above article says you're good legally but I believe anything behind a password is where the line is drawn. I'm not sure if that means other people's passwords (hacking their accounts?) or your own. I've taken the former approach and I think the use you are doing is a perfect example of something that would be legal. You have access to all of the data already, this just speeds it up.

Anyway, great article!

Collapse
 
futoricky profile image
Ricardo A. Mercado • Edited

Thanks!

Yeah you are prompted to fill in your personal LinkedIn credentials. The script logs you in and gets the emails from your personal connections. It's basically automizing a process I could do manually.

Collapse
 
lilmissblockchain profile image
lilmissblockchain

Very good article Ricardo. Thank you.

Collapse
 
krusenas profile image
Karolis

I had a similar needs few months ago :) I created a chrome extension to accomplish several things for me:

  1. Search for people that I would like to connect and connect
  2. Endorse all their skills

It was quite an interesting exercise for me as I haven't tried developing browser extensions before. Also, I have never encountered any rate limiting so I deem browser extensions to be quite safe to use.

Collapse
 
futoricky profile image
Ricardo A. Mercado

Awesome! Is the chrome extension public?

Collapse
 
lilmissblockchain profile image
lilmissblockchain • Edited

Sounds super interesting, would love to read a blog about this.

Collapse
 
turnerj profile image
James Turner

FYI, it seems that LinkedIn does actually allow you to download emails via the CSV you mentioned however each connection must opt-in for that.

LinkedIn Email Settings

Collapse
 
futoricky profile image
Ricardo A. Mercado

Interesting! Thanks for pointing this out.

Collapse
 
crewxx profile image
Crewxx • Edited

PLEASE SORRY FOR THE DUMB QUESTION, AS YOU KNOW NOT ALL IS TECH SAVVY, I'M JUST IN NEED OF GETTING MY CONTACT WHICH IS STRESSFUL GETTING THEM ONE AFTER THE OTHER. PLEASE I HAVE BEEN TRYING TO FIGURE OUT THE PROCESS IN MAKING THE CHANGES YOU TALKED ABOUT BUT I HAVE NO IDEA ON STEPS TO TAKE.

PLEASE KINDLY WORK ME THROUGH THE PROCESS, A DIRECTION OF WHERE TO CHECK TO CONFIRM THE LINKEDIN CHANGES AND REPLACING WOULD BE REALLY APPRECIATED.

Collapse
 
stealthmusic profile image
Jan Wedel

You seem to have accidentally enabled your Caps lock...

Collapse
 
crewxx profile image
Crewxx

Nope, not really just wanted a bold text. Any help please, thanks.

Collapse
 
misterhtmlcss profile image
Roger K.

While I kind of agree, I also don’t agree.

I also don’t connect with people I don’t know and it has nothing to do with his behaviour, but a matter of practice self harm reduction. If any one of my connections minus the recruiters of course were to do the same as the author I would assume it’s a reasonable use case and be fine with it.

Fundamentally I have no issue with someone wanting the access they were granted, but if you connect with randoms then you get what you get. Maybe it’s a little like dating ;)

Collapse
 
lilmissblockchain profile image
lilmissblockchain

Didn't we.ver date start out as a random connection though
🤔

Collapse
 
itsasine profile image
ItsASine (Kayla)

Email isn't a completely unused field, though it looks like they only provide publically available emails rather than any ones you're privy to as a connection.

I downloaded my 216 connections and had 1 email address (a chronic startup founder, so he wants to be seen) and 1 completely empty line other than connection date. I just reused that field as one for describing, manually, how I know them since for some awful reason LinkedIn removed the ability to tag people.

Collapse
 
zatrix_za profile image
ZatriX • Edited

Hm... Firstly - thanks a lot, Ricardo!

Some code needed to be changed indeed, to account for renamed fields, but then it did start working.

The problem I'm having atm, however, is it seems to get stuck after scraping about 180 records (see screen). It gives a few errors extracting (emails exist on the profile) and then just sits there.

Any ideas?

screen

Collapse
 
tahakucukcom profile image
Taha Yasin KÜÇÜK

TL;DR was too long 😀

Collapse
 
taviroquai profile image
Marco Afonso

Nightmare vs cheeriojs?

Collapse
 
futoricky profile image
Ricardo A. Mercado

Either, the one that works best for what you need to do. I used nightmare because it was the first one that came to mind.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.