π Final Project Post in DEV | Demo Link in Youtube | Link to code in Github
In this post, we will explore how to create a web search engine using apache solr and later enhance it using auto-complete and auto-suggest feature. Lets divide it into multiple steps for simpler understanding:
- Crawl Initial Data:
First we'll crawl our initial training data using crawler4j. We shall limit our crawler to only crawl html pages of latimes news website with maximum depth as 16 and maximum number of pages to crawl to 20,000 html pages. For efficient crawling it is a good idea to have multiple crawling threads. crawler4j supports multi-threading, so I have set number of crawlers to 7 in my project. We can change this configuration as required while defining our Crawler. Source code for this crawler can be found at Web Crawler.
- Indexing web pages using Solr:
After we have crawled and downloaded the html pages related to our site, we can then index them using solr as described in Index Data Using Solr
Apache Tika is a library that is used for document type detection and content extraction from various file formats. Internally, Tika uses various existing document parsers and document type detection techniques to detect and extract data. For example, in case of HTML pages, Tika uses its HTMLParser to strip out all html tags and only stores the content from the html pages. Tika is a powerful tool and is very useful when you have crawled various types of documents, e.g. PDF, Images, Videos. Tika is included with the Solr installation.
Solr uses its default search engine known as Lucene for searching relevant results for given query and displays them. We can also tell Solr to use other search engines in configuration. In this example, I'm also configuring solr to search based on page-rank algorithm instead of default lucene algorithm. Using the downloaded html pages, we create a directed graph and then using networkx library we compute page ranks of all the downloaded html pages and feed this info into solr.
I then created a simple php file which takes a query and search-engine algorithm as input and displays search results as table to the user.
- Enhancing Search Engine:
AutoSuggest - There are various auto-suggest or spell-correction algorithms available online. I have used spell-correction program developed by Peter Norvig available at Spell Corrector. Using Apache Tika parser, I created big.txt to calculate edit distance from our initial training data.
AutoComplete - There are several ways to implement the autocomplete functionality while using Solr. One possible way is to use the FuzzyLookupFactory feature of Solr/Lucene. The FuzzyLookupFactory creates suggestions for misspelled words in fields. It assumes that what youβre sending as the suggest.query parameter is the beginning of the suggestion. It will match terms in your index starting with the provided characters. So, if the query is "ca" it will return all the words starting with "ca", e.g. "california" and βcarolina" etc.
At last, I updated my initial php page to include these enhanced features. Once, this is completed you have developed a simple web search engine as shown in below video.
Complete source code for this project can be found at Web Search Engine.
Top comments (0)