How to Scrape LinkedIn for the Latest Job Postings: A Step-by-Step Guide 🖥️💼

#android #kotlin #webscraping

In today's tutorial, I’m going to show you how to efficiently scrape LinkedIn for up-to-date job postings—straight from the source! 🔍 Whether you're a developer looking to automate your job search or simply want a powerful tool for staying ahead of the competition, this guide has got you covered.

What's more, the core logic behind this process is the same technology powering my app, CoverAI: AI-Powered Cover Letter Generator. 🚀

With CoverAI, not only can you instantly generate tailored cover letters, but you can also integrate your resume and analyze job postings to extract key details—saving you time while increasing your chances of landing that dream job. 🌟

If you’re ready to streamline your application process and make every job application count, check out CoverAI

We're going to use Jsoup, Java HTML parser that is fully compatible with Kotlin

Setup

libs.versions.toml

[versions]
jsoup = "1.17.2"

[libraries]
jsoup = { module = "org.jsoup:jsoup", version.ref = "jsoup"}

build.gradle.kts

implementation(libs.jsoup)

Base definition

If you're thinking of using web scraping for different purposes (apart from the one discussed here), it's a good idea to put the main logic into an interface

interface Scraper {
    fun Connection.getPageDocument(): Document? = return this.ignoreContentType(true).get()
}

The code above executes the get request on a Jsoup Connection while ignoring the content type

In order to efficiently store job posting data, define an appropriate data class:

data class LinkedInJob(
    val id: String = UUID.randomUUID().toString(),
    val title: String,
    val company: String,
    val location: String,
    val link: String)

Next we'll define a LinkedInJobsScraper class:

class LinkedInJobsScraper(
    private val connection: (String) -> Connection
): Scraper {
    fun getJobs(country: String, jobTitle: String, hiringOrganization: String): List<LinkedInJob> {
        val url = constructUrl(country, coverLetter)
        val document = connection(url).getPageDocument()
    }
    private fun constructUrl(country: String, jobTitle: String, hiringOrganization: String): String {
        val jobTitle = coverLetter.jobTitle
        val hiringOrganization = coverLetter.hiringOrganization
        return "https://www.linkedin.com/jobs/search?keywords=${jobTitle} $hiringOrganization" +
                "&location=$country".replace(" ", "%20")
    }
}

connection parameter is of higher-order function type. It's a function that takes a String url as input and returns a Connection object. getJobs uses that Kotlin features to establish an appropriate connection, and then get the Document

Here's what a url requesting jobs could look like: https://www.linkedin.com/jobs/search?keywords=Android%20Developer%20Google&location=Germany

Replacing spaces with %20 in URLs is important because spaces are not valid characters in a URL. URLs have specific encoding rules to ensure they can be properly understood by web browsers and servers. Spaces are replaced by %20, which is the ASCII hexadecimal value for a space character, ensuring that the URL is correctly encoded and interpreted.

The last thing left to do related to scraping is extracting appropriate text with CSS queries:

inside getJobs function

var jobCards = document.select("div.base-card.base-card--link.job-search-card")
        if (jobCards.isEmpty()) {
            jobCards = document.select("li a.base-card")
        }
        val linkedInJobs = jobCards.map { card ->
            val titleElement = card.selectFirst("h3.base-search-card__title")
            val companyElement = card.selectFirst("h4.base-search-card__subtitle")
            val locationElement = card.selectFirst("span.job-search-card__location")
            val linkElement = card.selectFirst("a.base-card__full-link")

            LinkedInJob(
                title = titleElement?.text() ?: "",
                company = companyElement?.text() ?: "",
                location = locationElement?.text() ?: "",
                link = linkElement?.attr("href") ?: card.attr("href") ?: ""
            )
        }
        return linkedInJobs

I've tested job cards extraction with different CSS queries and have come to the conclusion that the structure of downloaded HTML from LinkedIn can vary, which is why document.select("li a.base-card") and linkElement?.attr("href") ?: card.attr("href") are used.
Inside of the map function we're extracting titleElement, companyElement, locationElement, and linkElement from each individual Element, inserting them into LinkedInJob data class, and saving them inside of linkedInJobs variable.

DEV Community

How to Scrape LinkedIn for the Latest Job Postings: A Step-by-Step Guide 🖥️💼

Setup

Base definition

Top comments (0)