Our ability to search online for various things and get relevant results is quite a technical achievement, especially at the scale that search engines need to work at. They need to build large indexes of websites and content so they can process are queries and bring us to the content we are after.
I previously talked about the "Robots.txt" file and the relationship between web crawlers and the website operator. That is one piece of the puzzle where it helps both the web crawler and the website operator communicate about what shouldn't be indexed on a site.
Another piece of the puzzle are sitemaps, a file that helps you tell web crawlers and search engines about all the pages on your site, when they were last updated and the frequency of updates. This information isn't possible by just crawling pages by links in content.
A Brief History
Maybe not so surprisingly we have Google to thank about starting the concept of sitemap files back in mid 2005.
In November 2006, Yahoo and Microsoft joined Google in support of the standard with the schema "Sitemap 0.9".
Not long after that, they jointly announced support for a non-standard feature on "Robots.txt" files, allowing them to point to where sitemaps for a website can be located.
For example, here is dev.to's robots file pointing to the location of the sitemap:
# See http://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file
#
# To ban all spiders from the entire site uncomment the next two lines:
# User-agent: *
# Disallow: /
Sitemap: https://thepracticaldev.s3.amazonaws.com/sitemaps/sitemap.xml.gz
The Format
There are 3 flavours of sitemaps: XML, TXT and RSS
XML Sitemaps
These are most likely the only form of sitemap you will ever actually work with and what is the core format defined in the specification. That said, not all XML sitemaps are the same as there are two different types.
Normal Sitemap File
You have a number of <url>
tags which all must have a <loc>
tag but optionally the <lastmod>
, <changefreq>
and <priority>
tags.
The <loc>
tag is simply the absolute URL of the page on your site.
The <lastmod>
tag helps indicate the "freshness" of a page. While a crawler might prioritise based on this value, I wouldn't recommend constantly updating the last modified to the current date to try and game the system.
The <changefreq>
tag is only a guideline for crawlers, don't think setting it to "hourly" will make web crawlers instantly crawl your site more often.
The <priority>
tag isn't for defining how important that page is compared to other websites but how important it is for the web crawler to even crawl that page. This has a default value of "0.5" when not set.
Example XML Sitemap from the specification:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/</loc>
<lastmod>2005-01-02</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.example.com/catalog?item=12&desc=vacation_hawaii</loc>
<changefreq>weekly</changefreq>
</url>
<url>
<loc>http://www.example.com/catalog?item=73&desc=vacation_new_zealand</loc>
<lastmod>2004-12-23</lastmod>
<changefreq>weekly</changefreq>
</url>
<url>
<loc>http://www.example.com/catalog?item=74&desc=vacation_newfoundland</loc>
<lastmod>2004-12-23T18:00:15+00:00</lastmod>
<priority>0.3</priority>
</url>
<url>
<loc>http://www.example.com/catalog?item=83&desc=vacation_usa</loc>
<lastmod>2004-11-23</lastmod>
</url>
</urlset>
Sitemap Index File
According to the standard, a normal sitemap file is limited to 50,000 URLs and a maximum size of 50MB. While I don't necessarily believe that type of limitation is still enforced, it did give to the rise of the a sitemap index file.
These files basically look a lot like a normal sitemap file but basically just point to other sitemaps. You have a number of <sitemap>
tags which contain a required <loc>
tag and an optional <lastmod>
tag.
Example Index sitemap from the specification:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>http://www.example.com/sitemap1.xml.gz</loc>
<lastmod>2004-10-01T18:23:17+00:00</lastmod>
</sitemap>
<sitemap>
<loc>http://www.example.com/sitemap2.xml.gz</loc>
<lastmod>2004-01-01</lastmod>
</sitemap>
</sitemapindex>
TXT Sitemaps
This type of sitemap really removes a lot of the functionality that you would find in an XML sitemap like the last modified date or how frequently a page is updated.
This format is just having each URL you want indexed on a new line with absolutely no other data.
http://www.example.com/
http://www.example.com/catalog?item=12&desc=vacation_hawaii
http://www.example.com/catalog?item=73&desc=vacation_new_zealand
RSS Sitemaps
While not as limited as TXT Sitemaps, RSS sitemaps have their own issues like only providing information on recent URLs.
You would use the <link>
tag to define the URL you want indexed and and <pubDate>
to define when it was last modified.
The future of Sitemaps
The main specification for sitemaps hasn't changed but there are some additional types of sitemaps being developed like video sitemaps, image sitemaps and special Google News sitemaps.
Google have also announced support for multilingual sitemaps where one can define the language each URL.
The support around these additional sitemap types isn't as widespread as the main XML sitemap though that may change in the future.
I wrote a thing...
My last few articles have really come around because of my work building libraries and tools that are solving problems for me and this one is no exception.
TurnerSoftware / SitemapTools
A sitemap (sitemap.xml) querying and parsing library for .NET
Key features
- Parses both XML sitemaps and sitemap index files
- Handles GZ-compressed XML sitemaps
- Supports TXT sitemaps
Licensing and Support
Sitemap Tools is licensed under the MIT license. It is free to use in personal and commercial projects.
There are support plans available that cover all active Turner Software OSS projects. Support plans provide private email support, expert usage advice for our projects, priority bug fixes and more. These support plans help fund our OSS commitments to provide better software for everyone.
Notes
- Does not enforce sitemap standards as described at sitemaps.org
- Does not validate the sitemaps
- Does not support RSS sitemaps
Example
using TurnerSoftware.SitemapTools;
var sitemapQuery = new SitemapQuery();
var sitemapEntries = await sitemapQuery.GetAllSitemapsForDomainAsync("example.org");
I had a need to actually parse sitemap files for a project I am working on and struggled to find any existing .NET library to do so. The latest version of my library builds upon my own "Robots.txt" parsing library (for sitemap file discovery) and supports XML sitemaps (both normal and index files) as well as TXT sitemaps.
This library and my "Robots.txt" parsing library actually build toward a third library which I will be writing an article about in the future.
More Information
- sitemaps.org: The official site for the format
- "Sitemaps" on Wikipedia: Covers additional details about sitemaps and any extended functionality.
Top comments (2)
Hi James Turner,
I need to crawler the web page using your repo and I am using .net core 3.0 and getting error and unable to get the sitemaps of site?
{System.Linq.EmptyPartition}
Its probably best to raise a new issue on GitHub regarding any issues you encounter. When you do that, if you could provide more information like whether an exception was thrown, what methods you called and any information about the sitemap file itself will go a long way for helping me diagnose and fix the issue.