DEV Community

Vic
Vic

Posted on

Scraping Real State Website

linktree

This Python script uses the Scrapy, requests, and price_parser libraries to scrape a website that lists properties for sale. It extracts details about each property such as price, title, address, number of baths and rooms, area, owner info, owner url, and coordinates (latitude, longitude).

Libraries

  • Scrapy: An open-source web-crawling framework for Python.
  • requests: A library to send all kinds of HTTP requests.
  • price_parser: A library to extract price and currency from raw text strings.

Let's dissect this script step-by-step:

Import Libraries

from scrapy import Selector
import requests
from urllib.parse import urljoin
from price_parser import Price
Enter fullscreen mode Exit fullscreen mode

The above lines import the necessary Python libraries for the script.

Setting the Initial Variables

response = requests.get("https://www.pisos.com/venta/pisos-cedeira/")
sel = Selector(response)

home_url = "https://www.pisos.com"
Enter fullscreen mode Exit fullscreen mode

The script sends a GET request to the URL of the website and uses the Selectorclass from Scrapy to create an object that can be used for parsing the HTML.

Number Filtering Function

def number_filtering(number):
    if type(number) == int:
        return number
    if type(number) == float:
        return(round(number))
    if type(number) == str:
        number = Price.fromstring(number)
        number = number.amount
        if number is None:
            return None
        try:
            return int(number)
        except Exception:
            return float(number)
Enter fullscreen mode Exit fullscreen mode

This function converts string-based numbers into their integer or float representations. If the input is already an integer or a float, it returns the input as it is.

Get Text Between Substrings Function

def get_text_between(full_string, start_substring, end_substring):
    start = full_string.find(start_substring) + len(start_substring)
    end = full_string.find(end_substring, start)
    return "" if start == -1 or end == -1 else full_string[start:end]
Enter fullscreen mode Exit fullscreen mode

This function takes three arguments: the full string and two substrings. It finds the text located between the two substrings.

Get Latitude and Longitude Function

def get_lat_lon(response):
    selector = Selector(response)
    lat = get_text_between(selector.css("script[type='text/javascript'] ::text").get(), "_Lat = ", ";")
    lon = get_text_between(selector.css("script[type='text/javascript'] ::text").get(), "_Long = ", ";")
    return lat, lon
Enter fullscreen mode Exit fullscreen mode

This function extracts the latitude and longitude values from the JavaScript included in the page's HTML.

Parse Ad Function

def parse_ad(ad_response):
    ...
    print(f"Price: {price}")
    print(f"Title: {title}")
    print(f"Address: {address}")
    print(f"N_baths: {n_baths}")
    print(f"N_rooms: {n_rooms}")
    print(f"Area: {area}")
    print(f"Owner info: {owner_info}")
    print(f"Owner url: {owner_url}")
    print(f"Description: {description}")
    print(f"Source id: {source_id}")
    print(f"Latitude: {lat}")
    print(f"Longitude: {lon}")
    print("=============================================================")
Enter fullscreen mode Exit fullscreen mode

This function parses the HTML of an ad and prints out the data about the property. It extracts the price, title, address, number of baths and rooms, area, owner info, owner url, description, source id, and coordinates (latitude, longitude) from the ad's HTML.

Parse All Ads

all_ads = sel.css("div.ad-preview")
for ad in all_ads:
    url = ad.css("a::attr(href)").get()
    ad_response = requests.get(urljoin(home_url, url))
    parse_ad(ad_response)
Enter fullscreen mode Exit fullscreen mode

Finally, the script iterates over all ad preview divs, sends a request to each ad's URL, and then parses the response with the parse_ad() function.

Full code -> https://gist.github.com/VictorLG98/994874841e52213cf20e7c2a91ee781a

Video on my Youtube -> linktree

Top comments (0)