Bradley Schoeneweis

Posted on Jun 2, 2021 • Edited on Dec 28, 2021 • Originally published at bradleyschoeneweis.com

Visualizing your LinkedIn Connections using Python

#python #pandas #datavisualization #networkx

Using Python's Pandas, NetworkX, and pyvis to understand and visualize companies within a directly connected LinkedIn network.

tl;dr

Goal

To understand and visualize the companies within my directly connected network on LinkedIn

Process overview

LinkedIn data sources - retrieving LinkedIn Network data from a "Get a copy of your data" CSV export
Diving into the data - exploring, cleaning, and aggregating the data with Pandas
Creating the network - creating a network graph using NetworkX
Visualization - visualizing the network with pyvis
Improving the output - cleaning up the network graph with additional filtering

Results

Hover over the nodes for more details

Python dependencies

# Python standard library
from difflib import get_close_matches

# 3rd party
import networkx as nx
import pandas as pd
from pyvis.network import Network

Recently, I was exploring my LinkedIn network to see what some of my colleagues from high school and undergrad are currently up to.

As I was scrolling through the connections page, I noticed LinkedIn gives you options to filter and searching with ease, but it doesn't really provide tools to learn about your network as a whole.

So I decided to see if there was an easy way to export my network data to see what I could do with a few hours of exploring the data.

LinkedIn data sources

My first thought was to checkout out the LinkedIn's Developer API.

Something I do fairly frequently at my current job is integrating various 3rd-party REST APIs into our platform, so I wanted to see all the functionality and possibilities that this API would provide.

After reading through some documentation, I decided this wasn't a direction I wanted to pursue. Most of their developer products require approval, so I decided to look into other options.

Another thought I had was to write a quick scraping script to pull down the HTML of my connections page and parse out names and companies, but I assumed there had to be a more simple way to get this data.

Finally, after a bit of research, I found that there are various "Get a copy of your data" reports that you can run within LinkedIn. In order to get to these reports, you can do the following:

On the homepage toolbar, click the Me dropdown
Under the Account section, click Settings & Privacy
Click on Get a copy of your data, and you can view the various reports
Select the reports you're interested in, for this, I just checked Connections

After requesting the report, it should only take a few minutes before you get an email saying your report is ready for export.

Diving into the data

To reiterate our goal, we want to get a broad understanding of the companies within the first layer of our network (direct connections). Now, let's load up Python and learn more about this data in this CSV.

Reading in the data

Once the CSV is downloaded, we can open it up with Pandas and take a look (output will be commented below).

import pandas as pd

# We want to skip the first three rows because of Notes at the top
df = pd.read_csv('Connections.csv', skiprows=3)

df.columns
# ['First Name', 'Last Name', 'Email Address', 'Company', 'Position', 'Connected On',]

df.info()
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 376 entries, 0 to 375
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   First Name     375 non-null    object
 1   Last Name      375 non-null    object
 2   Email Address  1 non-null      object
 3   Company        371 non-null    object
 4   Position       371 non-null    object
 5   Connected On   376 non-null    object
dtypes: object(6)
memory usage: 17.8+ KB
"""

I won't post the name's of any individuals or full rows to respect the privacy of my connections, but when I searched through the my Connections CSV, I noticed a few initial patterns that would help clean up the data.

Cleaning up the data

At first glance, the first thing I notice is connections who don't list a current company, so let's get rid of those.

df = df[df['Company'].notna()].sort_values(by='Company')

After sorting, another thing I noticed was that some of these company names belong to the same company, but the individuals wrote them differently.

An example of this is 'IBM Global Solution Center' and 'IBM'; for our purposes, these should both be classified as IBM.

Let's run through a fuzzy match run using difflib's get_close_matches to try and bucket some of these similar company names.

from difflib import get_close_matches

companies = df['Company'].drop_duplicates()

# cutoff=0.7 is a similarity ranking, and n=10 just takes the top 10 values
similar_companies = {x: get_close_matches(x, companies, n=10, cutoff=0.7)
                     for x in companies}

# We are only interested in the entries that had another match
similar_companies = {x: [name for name in y if name != x]
                     for x, y in similar_companies.items() if len(y) > 1}

Now, this solution is not perfect, but it will help draw out some similar companies. You should still run a manual inspection of the data (the IBM example I gave above is one that doesn't show up in the fuzzy match results).

Based upon the results, let's group together some of the companies that had matches.

df['Company'] = df['Company'].replace({
    'KPMG US': similar_companies['KPMG US'],
    'Self-employed': similar_companies['Self-employed'],
    'IBM Global Solution Center': 'IBM',
})

The next thing you may have noticed is that in our similar_companies dictionary, we cleaned up a Self-employed entry.

To stay aligned with our goal, let's drop these entries, as well as your current company.

companies_to_drop = ['self employed', 'your current company']
df = df[~df['Company'].str.lower().isin(companies_to_drop)]

Aggregating the data

Now that our data is cleaned up a bit, let's aggregate and sum the number of connections for each of the companies.

df_company_counts = df['Company'].value_counts().reset_index()
df_company_counts.columns = ['Company', 'Count']  # For ease of understanding
df_company_counts = df_company_counts.sort_values(by='Count', ascending=False)

Creating the network

We have the numbers we want for each company, now let's jump into using NetworkX to recreate a network.

The first step will be to initialize our graph, and add yourself as the central node, as it is your network.

import networkx as nx

G = nx.Graph()
G.add_node('Me')

Then, we'll loop through our df_company_counts DataFrame and add each company as a node.

You'll notice some HTML tags in the title below, this is just to make it more readable for later

for _, row in df_company_counts.iterrows():
    # The title will be for more information later on
    title = '<b>{0}</b> ({1})<br><hr>Positions:<br>'.format(row['Company'],
                                                            row['Count'])

    # In addition to the full company name, let's add each position in a
    # list to see the roles our connections have at these companies
    position_list = ''.join('<li>{}</li>'.format(x)
                            for x in df[df['Company'] == row['Company']]['Position'])
    title += '<ul>{0}</ul>'.format(position_list)

    # For ease of viewing, limit company names to 15 letters
    node_name = row['Company']
    if len(node_name) > 15:
        node_name = node_name[:15] + '...'

    # Add the node and an edge connection ourself to the new node
    G.add_node(node_name, weight=row['Count'], size=row['Count'] * 2, title=title)
    G.add_edge('Me', node_name)

And just like that, we've created our network of connections.

Visualization

Our network graph is created, so let's get into visualizing the network.

There are a few options for visualizing networks including matplotlib.pyplot, but I found that pyvis was the easiest to use for several reasons:

pyvis generates an HTML file
Customization is made very easy
The graph is interactive by default

Let's look into generating this HTML file.

from pyvis.network import Network

nt = Network('100%', '100%', bgcolor='#222222', font_color='white')
nt.from_nx(G)
nt.repulsion()  # Spaces out the nodes
nt.show('nx.html')

And it's that simple! We specify a width and height, optional styling attributes, and then we can generate the network graph visual straight from what we created with NetworkX.

Now we can see the network we generated.

You can hover over each node to see the total number of connections that work at the respective company, and below is a list of the positions held by your connections.

As you can see, this is a bit hard to read into since there are a lot of nodes. Try and imagine reading this with +1,000 connections.

Improving the output

There are a few ways that our network could be narrowed down.

Being a Software Developer, the thought that first occurred to me was to try and dial in on tech-related companies through known positions titles.

To do this, I thought of a list of buzzwords/common job titles that I've seen across LinkedIn, and filtered down the initial DataFrame.

Then, we go through the same process we did in previous sections of generating and displaying the graph.

Again, this is not perfect, but it's a good starting point.

# Filter down from a list of popular tech positions
positions = [
    'developer', 'engineer', 'ai', 'analytics', 'software', 'cloud', 'cto',
    'sde', 'sre', 'saas', 'product', 'engineering', 'scientist', 'data',
]
df = df[df['Position'].str.contains('|'.join(positions), case=False)]
df_company_counts = df['Company'].value_counts().reset_index()
df_company_counts.columns = ['Company', 'Count']
df_company_counts = df_company_counts.sort_values(by='Count', ascending=False)

# Re-initialize the graph and add the nodes/edges again
G = nx.Graph()
G.add_node('Me')

for _, row in df_company_counts.iterrows():
    title = '<b>{0}</b> ({1})<br><hr>Positions:<br>'.format(row['Company'], row['Count'])
    position_list = ''.join('<li>{}</li>'.format(x)
                            for x in df[df['Company'] == row['Company']]['Position'])
    title += '<ul>{0}</ul>'.format(position_list)
    node_name = row['Company']
    if len(node_name) > 15:
        node_name = node_name[:15] + '...'

    # Since there are less nodes, let's increase the sizes
    G.add_node(node_name, weight=row['Count'], size=row['Count'] * 5, title=title)
    G.add_edge('Me', node_name)

# Generate the visualization
nt = Network('100%', '100%', bgcolor='#222222', font_color='white')
nt.from_nx(G)
nt.repulsion()
nt.show('nx.html')

Now, let's look at the updated results.

Much better! This is more readable and easier to interact with.

And just like that, we achieved our goal of gaining a broader understanding of the companies in our LinkedIn network.

Possible improvements for those interested

Scraping the profile location of each of your connections to segment by location
Compiling a list of companies you'd like to work for/are interested in and creating a filtering system
Researching salary data for positions and gathering average pay by company

DEV Community