DEV Community

Cover image for Google Like Search Engine
Dhruva Shaw
Dhruva Shaw

Posted on • Edited on

Google Like Search Engine

Konohagakure Search

https://github.com/Sainya-Ranakshetram-Submission/search-engine

Minato Namikaze Konohagakure Yondaime Hokage

We were asked to do the following:

Develop an efficient Search Engine with the following features it should have distributed crawlers to crawl the private/air-gapped networks (data sources in these networks might include websites, files, databases) and must work behind sections of networks secured by firewalls

It should use AI/ML/NLP/BDA for better search (queries and results) It should abide by the secure coding practices (
 and SANS Top 25 web vulnerability mitigation techniques.) feel free to improvise your solution and be creative with your approach Goal

Have a search engine which takes keyword/expression as an input and crawls the web (internal network or internet) to get all the relevant information. The application shouldn't have any vulnerabilities, make sure it complies with OWASP Top 10 Outcome Write a code which will scrape data, match it with the query and give out relevant/related information. Note - Make search as robust as possible (eg, it can correct misspelt query, suggest similar search terms, etc) be creative in your approach. Result obtained from search engine should display all the relevant matches as per search query/keyword along with the time taken by search engine to fetch that result There is no constraint on programming language.

To Submit: - A Readme having steps to install and run the application - Entire code repo - Implement your solution/model in Dockers only. - A video of the working search engine
Enter fullscreen mode Exit fullscreen mode

Features

  • Corrected Spelling suggestions
  • Auto Suggested
  • 3 different types of crawler
  • Distributed crawlers
  • A site submit form
  • Blazingly fast And so on...

Vulnerabilities the application that it address

It address the following SANS Top 25 Most Dangerous Software Errors and OWASP Top 10 Vulnerabilities

  1. Injection
  2. Broken Authentication
  3. Sensitive Data Exposure
  4. XML External Entities
  5. Broken Access Control
  6. Security Misconfiguration
  7. Cross-Site Scripting
  8. Insecure Deserialization
  9. Using Components with Known Vulnerabilities
  10. Insufficient Logging and Monitoring
  11. Improper Restriction of Operations within the Bounds of a Memory Buffer
  12. Improper Neutralization of Input During Web Page Generation ('Cross-site Scripting')
  13. Improper Input Validation
  14. Information Exposure
  15. Out-of-bounds Read
  16. Improper Neutralization of Special Elements used in an SQL Command ('SQL Injection')
  17. Use After Free
  18. Integer Overflow or Wraparound
  19. Cross-Site Request Forgery (CSRF)
  20. Improper Limitation of a Pathname to a Restricted Directory ('Path Traversal')
  21. Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection')
  22. Out-of-bounds Write
  23. Improper Authentication
  24. NULL Pointer Dereference
  25. Incorrect Permission Assignment for Critical Resource
  26. Unrestricted Upload of File with Dangerous Type
  27. Improper Restriction of XML External Entity Reference
  28. Improper Control of Generation of Code ('Code Injection')
  29. Use of Hard-coded Credentials
  30. Uncontrolled Resource Consumption
  31. Missing Release of Resource after Effective Lifetime
  32. Untrusted Search Path
  33. Deserialization of Untrusted Data
  34. Improper Privilege Management
  35. Improper Certificate Validation

Building Docker Image

Just run

docker build .
Enter fullscreen mode Exit fullscreen mode

Also check this out
If you wish you can do teh necessary image tagging.

After building the image install the docker image.

Hosting Guide (without the docker)

To run Konohagakure Search you need python3.9, latest version of golang,
postgres, rabbitmq and redis

See their installation instruction and download it properly.

After downloading the above mentioned softwares, now run the following commands in console after opening the terminal:

1. Clone the repository

Clone the repository using git

git clone https://github.com/Sainya-Ranakshetram-Submission/search-engine.git
Enter fullscreen mode Exit fullscreen mode

2. Install the virtual environment

pip install --upgrade virtualenv
cd search-engine
virtualenv env
env/scripts/activate
Enter fullscreen mode Exit fullscreen mode

3. Install the dependencies

pip install --upgrade -r requirements.min.txt
Enter fullscreen mode Exit fullscreen mode
python -m spacy download en_core_web_md
python -m nltk.downloader stopwords
python -m nltk.downloader words
Enter fullscreen mode Exit fullscreen mode
go install -v github.com/projectdiscovery/subfinder/v2/cmd/subfinder@latest
Enter fullscreen mode Exit fullscreen mode

4. Setup the environment variables

Rename the example.env to .env and setup the environment variables according to your choice.

5. Create a database

Now open pgadmin and create a database named search_engine. After creating the database reassign the DATABASE_URL value acordingly in .env file.
Note please read this also

6. Start Rabitmq and Redis Instance

Read their docs regarding how to start them. redis rabbitmq

7. Migrate the data

python manage.py migrate
Enter fullscreen mode Exit fullscreen mode

And to migrate the 10 Lakh dataset of the website for the crawler to crawl, do

python manage.py migrate_default_to_be_crawl_data
Enter fullscreen mode Exit fullscreen mode

I have also given some crawled datasets for the reference, you can see it here data_backup

8. Compress the static files

Now run the following command in the console:

python manage.py collectcompress
Enter fullscreen mode Exit fullscreen mode

9. Create a superuser for the site

python manage.py createsuperuser
Enter fullscreen mode Exit fullscreen mode

It asks for some necessary information, give it then it will create a superuser for the site.

10. Running the celery worker and beat

Now run this command in the terminal

python manage.py add_celery_tasks_in_panel
Enter fullscreen mode Exit fullscreen mode

Now, open two different terminals
And run these commands respectively :-

celery -A search_engine worker --loglevel=INFO
Enter fullscreen mode Exit fullscreen mode
celery -A search_engine beat -l INFO --scheduler django_celery_beat.schedulers:DatabaseScheduler
Enter fullscreen mode Exit fullscreen mode

11. Run the application

Before running the application, make sure that you have the redis up and running :)

  • For windows, mac-os, linux

Without IP address bound

    uvicorn search_engine.asgi:application --reload --lifespan off
Enter fullscreen mode Exit fullscreen mode

IP address bound

     uvicorn search_engine.asgi:application --reload --lifespan off --host 0.0.0.0
Enter fullscreen mode Exit fullscreen mode

If you are on Linux OS then you can run this command also instead of the above one:

    gunicorn search_engine.asgi:application -k search_engine.workers.DynamicUvicornWorker --timeout 500
Enter fullscreen mode Exit fullscreen mode

Python custom commands reference

  • add_celery_tasks_in_panel : Add the celery tasks to the django panel
  • crawl_already_crawled : Scraps already scrapped/crawled sites in database
  • crawl_to_be_crawled : Scraps newly entered sites in database || The sites that needs to be crawled ||
  • migrate_default_to_be_crawl_data : Enters BASE data of the websites that needs to be crawled, its about 10 Lakh sites

Distributed Crawlers

For the distributed web crawlers refer to the following scrapy documentation link

Running crawler manually from command line

There are 3 different ways in order to achieve this

1. crawl_already_crawled

This is custom django management command and it starts crawling the already crawled and stored sites and then updates it

python manage.py crawl_already_crawled
Enter fullscreen mode Exit fullscreen mode

2. crawl_to_be_crawled

This is custom django management command and it starts crawling the site which were entered using either the migrate_default_to_be_crawl_data custom command or it was entered using submit_site/ endpoint

python manage.py crawl_to_be_crawled
Enter fullscreen mode Exit fullscreen mode

3. Scrapy Command Line Crawler

This is a scrapy project that crawls the site using the command line
Here in example.com replace it with the site you want to crawl (without http or https`)

scrapy crawl konohagakure_to_be_crawled_command_line -a allowed_domains=example.com

Youtube Video Explaining all

Github Repo

Konohagakure Search

Minato Namikaze Konohagakure Yondaime Hokage

We were asked to do the following:

Develop an efficient Search Engine with the following features it should have distributed crawlers to crawl the private/air-gapped networks (data sources in these networks might include websites, files, databases) and must work behind sections of networks secured by firewalls
It should use AI/ML/NLP/BDA for better search (queries and results) It should abide by the secure coding practices (
 and SANS Top 25 web vulnerability mitigation techniques.) feel free to improvise your solution and be creative with your approach Goal

Have a search engine which takes keyword/expression as an input and crawls the web (internal network or internet) to get all the relevant information. The application shouldn't have any vulnerabilities, make sure it complies with OWASP Top 10 Outcome Write a code which will scrape data, match it with the query and give out relevant/related information. Note - Make search as robust

Top comments (0)