DEV Community

loading...
Cover image for How to make a simple webcrawler with JAVA ….(and jsoup)

How to make a simple webcrawler with JAVA ….(and jsoup)

Hans
Father of one, environmentalist, runner , round-earther, nerd
・3 min read

While Python is arguably the numero uno language to use when it comes to webscraping, good ole JAVA has it’s perks. At least for a JAVA developer like me who hasn’t quite yet delved in Python. If you are in a hurry, dont’t worry. The complete code is found at the end of this post.

Anywho, I wanted to figure out how to make a webcrawler w/JAVA, just for the lulz really. Turns out. It was way easier than expected. First of all you need to download jsoup(that is, you need to start a new JAVA project as well)

Link

Now as soon as IntelliJ has done its magic making your project you put the downloaded jsoup .jar file in the project root.

Alt Text

Now time to add som nice programming principles right? imports needed for this project are:


import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.io.IOException;
import java.util.ArrayList;
Enter fullscreen mode Exit fullscreen mode

Next — its constant’s time. I chose to index links (hrefs) on the CNN website. I chose CNN bc i am not a big fan of FOX news.


public class Crawler {
    public static final String CNN = "https://edition.cnn.com/";
public static void main(String[] args) {
System.out.println("Web Crawler ")
 }
}
Enter fullscreen mode Exit fullscreen mode

And then we need two methods utilising a cluster methods from the jsoup library
First one is a recursive method which indexes said page’s “a href=”´s. To not make the the method ramble on in infinity like a good recursive method will do I chose to make it stop two levels down.

private static void crawl (int level, String url, ArrayList<String> visited) {
    if(level <=2 ) {
        Document doc = request(url, visited);
        if (doc!= null) {
            for (Element link : doc.select("a[href]")) {
                 String next_link = link.absUrl("href");
                 if(visited.contains(next_link) == false) {
                     crawl(level++, next_link, visited);
                 }
            }
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

And then we need a Document method that checks for connection and returns a string of “Link “ + the url that is indexed.

private static Document request(String url, ArrayList<String> v) {
    try {
        Connection con = Jsoup.connect(url);
        Document doc = con.get();
        if(con.response().statusCode() == 200) {
            System.out.println("Link: " + url);
            System.out.println(doc.title());
            v.add(url);
            return doc;
        }
        return null;
    } catch (IOException e) {
        return null;
    }
}
Enter fullscreen mode Exit fullscreen mode

Finally the full code is ready to be run now

package.com.company
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;


public class Crawler {
    public static final String CNN = "https://edition.cnn.com/";


    public static void main(String[] args) {
      String url = CNN;
        crawl(1, url, new ArrayList<String>());

    }

    private static void crawl (int level, String url, ArrayList<String> visited) {
        if(level <=2 ) {
            Document doc = request(url, visited);
            if (doc!= null) {
                for (Element link : doc.select("a[href]")) {
                     String next_link = link.absUrl("href");
                     if(visited.contains(next_link) == false) {
                         crawl(level++, next_link, visited);
                     }
                }
            }
        }
    }
    private static Document request(String url, ArrayList<String> v) {
        try {
            Connection con = Jsoup.connect(url);
            Document doc = con.get();
            if(con.response().statusCode() == 200) {
                System.out.println("Link: " + url);
                System.out.println(doc.title());
                v.add(url);
                return doc;
            }
            return null;
        } catch (IOException e) {
            return null;
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Cheers
https://boreatech.medium.com/

Discussion (0)