Vadim Beskrovnov

Posted on Aug 9, 2022

How I parsed estate marketplace to build price graph stats

#java #parsing #scrapping

My goal

I was looking for a flat to buy, and I wanted to find something cheaper than the market. Estate market is quite efficient, so it is almost impossible to find cheap item manually, that’s why I decided to automate this process.

Solution

Idea

To understand, if a specific item is “cheap” or not, we need to have historical data of previous deals of this object or at least of similar objects in the same location. There is no public service, which can provide deals data, but there are a lot of property websites which allows you to find sales offers. I decided to write an application, which parses one of such sites and save all properties into the database. Then I could make queries and analyse price trend in the specific area.

Tools

Based on my experience, I decided to choose Java language to implement it. I started with command line application using Spring Boot and Spring Batch. High level data process looked as follows.

Architecture

Let's go through these components one by one.

Properties website

It is a website to parse. That time I was interested in property in Russia, so I used local portal: https://www.avito.ru. There are multiple categories, including flats. The structure is as following: there are a list of ads in each category with multiple pages and 50 items per page. Each item contains information about a specific property, that I was needed in.

Page Parser

This is the first component of my application, it receives category URL as an input: /moskva/kvartiry/prodam-ASgBAgICAUSSA8YQ?cd=1&p={page}. As you can see there are two parameters, the first one cd is always the same and the second one p is responsible for page number. Then using jsoup library, I read each page in cycle and collected URLs.

Elements items = document.select(".item");
for (Element item: items) {
    Elements itemElement = item.select(".item-description-title-link");
    String relativeItemReference = itemElement.attr("href");
    urls.add(relativeItemReference);
}

After reading of each page, I sent a list of URLs to the next component.

ID exists filter

Each item has the following URL: /moskva/kvartiry/2-k._kvartira_548m_911et._2338886814, it contains an identifier in the end (2338886814). This is the unique ID of the ad. I used it as a key in cache to avoid parsing the same items twice.

But some items can be parsed twice anyway because cache writes were made later, so multiple ads with the same ID could pass this gate.

Item Parser

After the filter, all unique IDs went to the next component – Item Parser. It uses ID to go to item page and read all data from this page.

Elements attributes = doc.select(".item-params-list-item");
Map <String, String> attrs = attributes.stream().collect(Collectors.toMap(
    attr -> attr.text().split(":")[0].trim(),
    attr -> attr.text().split(":")[1].trim()));

estate
    .setTotalSpace(Double.parseDouble(attrs.getOrDefault("Общая площадь", "").split(" ")[0]));

estate
    .setLiveSpace(Double.parseDouble(attrs.getOrDefault("Жилая площадь", "").split(" ")[0]));

...

As a result, an object with all property info is built and passed forward – to saver component.

Saver

This component is the last one in my pipeline. It receives items from Item Parser converts them to JSON and then saves to Elasticsearch, using batches to improve performance.

As a result, I was able to build multiple Kibana dashboards with prices and popularity metrics. One of the most useful components is the interactive map, that allows you to render data with coordinates(I got coordinates from the ad's description). It helps me find perfect property in good area with good price.

Problems

During this experiment, I faced some problems and tried different solutions, which I want to share.

IP address blocking

As you can guess, nobody wants to allow parsing their data, so this site also has different layers of protection. Thus, during development everything worked fine, because I made small amount of requests. But as soon as I started testing I faced with huge amount of 403 errors.

Firstly, I tried to use multiple headers and cookies to simulate a real user with a browser.

Document imageDoc = Jsoup
    .connect(url)
    .userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36")
    .header("referer", "https://www.avito.ru" + relativeItemReference)
    .header("accept", "*/*")
    .ignoreContentType(true)
    .get();

It didn’t help me. I think that they have much more intelligent checks than just verifying userAgent.

So my next try was to find the smallest timeout which can help me avoid blocking. To find it, I used free VPN services to be able to quickly change IP addresses. I have experimentally set a minimum timeout of 25 seconds. But it means that I can parse only ~3500 items per day, and it is definitely not enough.

To increase parsing speed, I decided to parallel my algorithm and use a proxy for each thread.

doc = Jsoup.connect(url)
    .proxy(proxy.getHost(), proxy.getPort())
    .userAgent("Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2")
    .header("Content-Language", "en-US")
    .timeout(timeout)
    .get();

So my only limitation now was the number of proxies available, and I was using public free proxies. I didn’t want to pay for it, so I had to use free unstable proxies, so some of them were slow, some were unstable.

My next improvement was to choose the best proxies from my list for each request. I made a scheduled job, that was checking each proxy every N minutes and save useful metadata like connection speed and number of errors.

@Scheduled(fixedDelay = 100, initialDelay = 1)
@Transactional
public void checkProxy() {
    ProxyEntity foundProxy = getProxyWithOldestUpdate()
        .orElseThrow(() -> new RuntimeException("Proxy not found"));

    int retries = 0;
    while (retries < retryCount) {
        log.debug("Attempt {}", retries + 1);
        if (checkProxy(foundProxy)) {
            log.debug("Proxy [{}] UP", foundProxy.getHost());
            foundProxy.setActive(true);
            break;
        }
        RequestUtils.wait(1000);
        retries++;
    }
    if (retries == retryCount) {
        foundProxy.setActive(false);
        log.debug("Proxy [{}] DOWN", foundProxy.getHost());
    }
    foundProxy.setCheckDate(LocalDateTime.now());
    proxyRepository.save(foundProxy);
}

Items duplicating

Sometimes it happens that people create a new ad for the same property, so for the system there are two different ads with different IDs. And it is fine for property website, but not for statistic and data analysis.

To get rid of duplicated items in my database, I just checked by title and description using string comparison. Sometimes it can be false positive, so I removed not a real duplicate, but a different ad with the same text. It is completely opposite, because it is totally fine for data analysis but critical for property website. Anyway, it solved my problem.

DEV Community