DEV Community

Silvester
Silvester

Posted on

Scraping movie data

We start by importing the libraries that we will need. Requests and BeautifulSoup are the standard libraries for scraping data from websites while the csv library is for writing the scraped data into a csv file.

```{python)
import requests
from bs4 import BeautifulSoup
import csv




The headers function reduces the chances of the website rejecting your scraping requests since it shows you are a genuine person and not a bot. You can get the headers by right clicking your current website page and clicking inspect. Thereafter, you go to networks and click where you see a status of 200. You will see Headers on your right hand side of the screen and when you scroll to the bottom of the page, you will see the headers starting with a user agent.



```{python}
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0"}
Enter fullscreen mode Exit fullscreen mode

The code below checks if the HTTP request to the IMDb page was successful (status code 200). When it returns a success, it uses BeautifulSoup to parse the HTML content of the web page. Thereafter, the script identifies and iterates through movie list items within an unordered list, extracting details such as rank, title, year, duration, parental advisory, and rating. It writes this information to a CSV file named "movies_data.csv" in a structured format. If the request fails, it prints an error message with the HTTP status code

# Create a session and send the request
with requests.Session() as session:
    link = session.get('https://m.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm', headers=headers)

# Check if the request was successful (status code 200)
if link.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(link.text, 'html.parser')

    # Find all list items within the unordered list
    movies_items = soup.find("ul", class_="ipc-metadata-list").find_all("li")

    # Open a CSV file for writing
    with open("movies_data.csv", mode="w", encoding="utf-8", newline="") as file:
        # Create a CSV writer
        writer = csv.writer(file)

        # Write the header row
        writer.writerow(["Rank", "Title", "Year", "Duration", "Parental Advisory", "Rating"])

        # Iterate through each list item and write to the CSV file
        for movie_item in movies_items:
            rank = movie_item.find("div", class_="sc-94da5b1b-0").get_text(strip=True).split('(')[0]
            title = movie_item.find("a", class_="ipc-title-link-wrapper").text
            year = movie_item.find("div", class_="sc-c7e5f54-7").get_text(strip=True)[:4]
            duration = movie_item.find("div", class_="sc-c7e5f54-7").get_text(strip=True)[4:10]
            parental_advisory = movie_item.find("div", class_="sc-c7e5f54-7").get_text(strip=True)[10:]
            rating = movie_item.find("span", class_="ipc-rating-star").get_text(strip=True).split('(')[0]

            # Write a row to the CSV file
            writer.writerow([rank, title, year, duration, parental_advisory, rating])

    print("Data written to movies_data.csv successfully.")
else:
    print(f"Failed to retrieve the page. Status code: {link.status_code}")
Enter fullscreen mode Exit fullscreen mode

Top comments (0)