Our ability to search online for various things and get relevant results is quite a technical achievement, especially at the scale that search engines need to work at. They need to build large indexes of websites and content so they can process are queries and bring us to the content we are after.
I previously talked about the "Robots.txt" file and the relationship between web crawlers and the website operator. That is one piece of the puzzle where it helps both the web crawler and the website operator communicate about what shouldn't be indexed on a site.
Another piece of the puzzle are sitemaps, a file that helps you tell web crawlers and search engines about all the pages on your site, when they were last updated and the frequency of updates. This information isn't possible by just crawling pages by links in content.
Maybe not so surprisingly we have Google to thank about starting the concept of sitemap files back in mid 2005.
In November 2006, Yahoo and Microsoft joined Google in support of the standard with the schema "Sitemap 0.9".
Not long after that, they jointly announced support for a non-standard feature on "Robots.txt" files, allowing them to point to where sitemaps for a website can be located.
For example, here is dev.to's robots file pointing to the location of the sitemap:
# See http://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file # # To ban all spiders from the entire site uncomment the next two lines: # User-agent: * # Disallow: / Sitemap: https://thepracticaldev.s3.amazonaws.com/sitemaps/sitemap.xml.gz
There are 3 flavours of sitemaps: XML, TXT and RSS
These are most likely the only form of sitemap you will ever actually work with and what is the core format defined in the specification. That said, not all XML sitemaps are the same as there are two different types.
You have a number of
<url> tags which all must have a
<loc> tag but optionally the
<loc> tag is simply the absolute URL of the page on your site.
<lastmod> tag helps indicate the "freshness" of a page. While a crawler might prioritise based on this value, I wouldn't recommend constantly updating the last modified to the current date to try and game the system.
<changefreq> tag is only a guideline for crawlers, don't think setting it to "hourly" will make web crawlers instantly crawl your site more often.
<priority> tag isn't for defining how important that page is compared to other websites but how important it is for the web crawler to even crawl that page. This has a default value of "0.5" when not set.
Example XML Sitemap from the specification:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.example.com/</loc> <lastmod>2005-01-02</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url> <url> <loc>http://www.example.com/catalog?item=12&desc=vacation_hawaii</loc> <changefreq>weekly</changefreq> </url> <url> <loc>http://www.example.com/catalog?item=73&desc=vacation_new_zealand</loc> <lastmod>2004-12-23</lastmod> <changefreq>weekly</changefreq> </url> <url> <loc>http://www.example.com/catalog?item=74&desc=vacation_newfoundland</loc> <lastmod>2004-12-23T18:00:15+00:00</lastmod> <priority>0.3</priority> </url> <url> <loc>http://www.example.com/catalog?item=83&desc=vacation_usa</loc> <lastmod>2004-11-23</lastmod> </url> </urlset>
According to the standard, a normal sitemap file is limited to 50,000 URLs and a maximum size of 50MB. While I don't necessarily believe that type of limitation is still enforced, it did give to the rise of the a sitemap index file.
These files basically look a lot like a normal sitemap file but basically just point to other sitemaps. You have a number of
<sitemap> tags which contain a required
<loc> tag and an optional
Example Index sitemap from the specification:
<?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc>http://www.example.com/sitemap1.xml.gz</loc> <lastmod>2004-10-01T18:23:17+00:00</lastmod> </sitemap> <sitemap> <loc>http://www.example.com/sitemap2.xml.gz</loc> <lastmod>2004-01-01</lastmod> </sitemap> </sitemapindex>
This type of sitemap really removes a lot of the functionality that you would find in an XML sitemap like the last modified date or how frequently a page is updated.
This format is just having each URL you want indexed on a new line with absolutely no other data.
http://www.example.com/ http://www.example.com/catalog?item=12&desc=vacation_hawaii http://www.example.com/catalog?item=73&desc=vacation_new_zealand
While not as limited as TXT Sitemaps, RSS sitemaps have their own issues like only providing information on recent URLs.
You would use the
<link> tag to define the URL you want indexed and and
<pubDate> to define when it was last modified.
Google have also announced support for multilingual sitemaps where one can define the language each URL.
The support around these additional sitemap types isn't as widespread as the main XML sitemap though that may change in the future.
My last few articles have really come around because of my work building libraries and tools that are solving problems for me and this one is no exception.
A sitemap (sitemap.xml) querying and parsing library in C#
- Parses both XML sitemaps and sitemap index files
- Handles GZ-compressed XML sitemaps
- Supports TXT sitemaps
- Does not enforce sitemap standards as described at sitemaps.org
- Does not validate the sitemaps
- Does not support RSS sitemaps
I had a need to actually parse sitemap files for a project I am working on and struggled to find any existing .NET library to do so. The latest version of my library builds upon my own "Robots.txt" parsing library (for sitemap file discovery) and supports XML sitemaps (both normal and index files) as well as TXT sitemaps.
This library and my "Robots.txt" parsing library actually build toward a third library which I will be writing an article about in the future.