DEV Community

Arman Idrisi
Arman Idrisi

Posted on

Extracting Data from npmjs User Profile with Python and BeautifulSoup

Introduction

Learn how to extract data from an npmjs user profile using Python and BeautifulSoup. This tutorial will guide you through the process of fetching and parsing HTML content to extract information such as the user's profile image, username, name, social links, total number of packages, and details of the latest packages published by the user.

Prerequisites

Before we begin, make sure you have the following:

  • Python installed on your machine
  • The BeautifulSoup library installed (pip install beautifulsoup4)
  • The requests library installed (pip install requests)

Getting Started

Let's start by importing the necessary libraries and defining a function to extract the text from HTML elements:

from bs4 import BeautifulSoup
import requests

def extract_text(element):
    return element.get_text().strip() if element else ''
Enter fullscreen mode Exit fullscreen mode

Fetching the User Profile

The first step is to fetch the HTML content of the user's profile page. We will prompt the user to enter the username and construct the URL accordingly:

user = input("> Enter username: ")
url = f"https://www.npmjs.com/~{user}"
response = requests.get(url)
html = response.text
Enter fullscreen mode Exit fullscreen mode

Parsing the HTML

Next, we need to parse the HTML content using BeautifulSoup:

soup = BeautifulSoup(html, "html.parser")
Enter fullscreen mode Exit fullscreen mode

Extracting the User's Profile Image

Let's start by extracting the user's profile image URL. We can identify the relevant HTML element using CSS selectors and retrieve the src attribute:

img_element = soup.select_one("div._73a8e6f0 a img")
img = "https://npmjs.com" + img_element.get("src") if img_element else "NA"
Enter fullscreen mode Exit fullscreen mode

Extracting the Username and Name

We can extract the username and name in a similar manner. Identify the respective HTML elements and extract their text content:

username_element = soup.select_one("h2.b219ea1a")
username = extract_text(username_element) if username_element else "NA"

name_element = soup.select_one("div._73a8e6f0 div.eaac77a6")
name = extract_text(name_element) if name_element else "NA"
Enter fullscreen mode Exit fullscreen mode

Extracting Social Links

To extract the social links, we need to identify the relevant HTML elements and retrieve the href attribute of the associated <a> tags:

social_elements = soup.select("ul._07eda527 li._43cef18c a._00cd8e7e")
social = [e.get("href") for e in social_elements]
Enter fullscreen mode Exit fullscreen mode

Extracting the Total Number of Packages

We can extract the total number of packages by identifying the corresponding HTML element and extracting its text content:

total_packages_element = soup.select_one("div#tabpanel-packages h2.f3f8c3f4 span.c5c8a11c")
total_packages = extract_text(total_packages_element) if total_packages_element else "NA"
Enter fullscreen mode Exit fullscreen mode

Extracting Details of Latest Packages

Finally, we can extract the titles, descriptions, and

published information of the latest packages published by the user. We can iterate over the relevant HTML elements and extract the desired information:

package_elements = soup.select("div._0897331b ul._0897331b li._2309b204")
packages = []
for element in package_elements:
    title_element = element.select_one("h3.db7ee1ac")
    description_element = element.select_one("p._8fbbd57d")
    published_element = element.select_one("span._66c2abad")

    package = {
        'title': extract_text(title_element),
        'description': extract_text(description_element),
        'published': extract_text(published_element)
    }
    packages.append(package)
Enter fullscreen mode Exit fullscreen mode

Creating the Data Dictionary

Finally, let's create a dictionary containing all the extracted data:

data = {
    'image': img,
    'username': username,
    'name': name,
    'social': social,
    'total_packages': total_packages,
    'latest_packages': packages
}
Enter fullscreen mode Exit fullscreen mode

Printing the Extracted Data

To verify that the data extraction process is working correctly, we can print the data dictionary:

print(data)
Enter fullscreen mode Exit fullscreen mode

Conclusion

In this tutorial, you learned how to extract data from an npmjs user profile using Python and BeautifulSoup. We covered the steps involved in fetching the HTML content, parsing it using BeautifulSoup, and extracting various pieces of information such as the user's profile image, username, name, social links, total number of packages, and details of the latest packages published by the user. This knowledge can be applied to similar scenarios where you need to scrape data from websites for analysis or other purposes.

I hope you found this tutorial helpful! If you have any questions or feedback, please leave a comment below. Happy coding!

You can find source code here

Top comments (0)