DEV Community

Cover image for Thoughts about webpage indexing and crawling
Ali Sherief
Ali Sherief

Posted on

Thoughts about webpage indexing and crawling

If you ever built a website before, you probably know that before anyone can find it, it has to be indexed by Google. The thing is, how long does it take Google to index the pages in your site.

This article is a brain dump of all the thoughts I had about how crawling and the index works in the process of finding the answer to this question. Since Google search, excluding images/youtube is used far more than Bing, with Google in 2018 having 65% of the volume vs. Bing's 1.5% of volume according to this, I'm going to exclusively write about Google's indexer here.

This also isn't an SEO guide, I don't cover what kind of keywords you can use to boost your PageRank here.

Google's indexer

Basic idea

There is a crawler software that Google has called Googlebot, which crawls websites at a rate that depends on their popularity. The site's popularity is determined from its traffic and number of unique visits per some time frame. It reads a special file on the website called robots.txt points to information like sitemaps that Google needs to index the site. It also has a list of URLs in the website that should not be indexed.

A sitemap is an XML file that has entries of URLs and their last modified dates. Each URL is loaded by Googlebot, and it lets the Javascript on the page run, i.e. it won't just parse raw HTML files, and then it reads the DOM and then Googlebot, the crawler, sends it to Google's indexer for indexing.

Does this work for single-page apps (SPAs)? Yes, as Google will crawl and index the URL exactly as it's given, so you can pass in any hash characters in the URL.

The thing is, if you made a brand new site, you don't know how long it's going to take for Googlebot to drive by and crawl it. There are a few options you have, bearing in mind that duplicate requests for indexing won't speed up indexing at all.

  • Go to Google Search Console and upload your sitemap to it, after you verify your ownership of the site of course.
  • Ensure your robots.txt file has an entry like Sitemap: https://zenulabidin.github.io/sitemap.xml. This only causes Googlebot to merely notice the sitemap file when it comes along.
  • Send a ping to Google to retrieve your sitemap, by navigating to your sitemap location like http://www.google.com/ping?sitemap=https://zenulabidin.github.io/sitemap.xml.

Some other info about the crawler

Googlebot is very smart and knows when a page is trying to manipulate its rating by containing spam links and content. These pages and sites often get blacklisted and removed from the index which means they won't appear in Google searches at all. Spam user comments on forum and blog sites may also trigger Google's ban hammer on the site. The appeal process to Google takes a week or two and isn't fun to sit through after going through the trouble getting your site indexed.

There is a limit on how frequently Googlebot will crawl a site. As explained in the Official Google Webmaster's Central Blog:

Simply put, this represents the number of simultaneous parallel connections Googlebot may use to crawl the site, as well as the time it has to wait between the fetches. The crawl rate can go up and down based on a couple of factors:

  • Crawl health: if the site responds really quickly for a while, the limit goes up, meaning more connections can be used to crawl. If the site slows down or responds with server errors, the limit goes down and Googlebot crawls less.

  • Limit set in Search Console: website owners can reduce Googlebot's crawling of their site. Note that setting higher limits doesn't automatically increase crawling.

And it's important to note that crawling uses your server's bandwidth, unless you are hosting on a static site publisher like Github Pages or Netlify. So leaving a lot of site pages to be crawled unnecessarily will put strain on your servers. This is where robots.txt exclusions come handy.

Why your pages aren't in search results

If you just created the pages in question, then wait until Googlebot crawls the website again.

Depending on the google search queries you're using, the lack of listings which you think are caused by unindexed pages may actually be caused by the site having low or no traffic. If you want to secure a spot in the search results for that query, you need to market and spread your site in places like social media where people will see them.

I'm not very sure how long it takes for a page to get indexed but assuming your pages are already indexed and you modify them once in a while, they should be crawled by Googlebot and subsequently indexed by the next day.

Open questions

  • How does Googlebot know how many people visited a site?
  • If Googlebot is not the one that collects this information, then does it come from the search queries made against the index?

Image by Robert Balog from Pixabay

Top comments (1)

Collapse
 
crawlbase profile image
Crawlbase • Edited

Very nice! This insightful article sheds light on Google's indexer workings and offers practical tips for website owners. For enhanced crawling efficiency, tools like Crawlbase can streamline the indexing journey.