Robots Exclusion Tools

A "robots.txt" parsing and querying library in C#, closely following the NoRobots RFC and other details on robotstxt.org.

Features

Load Robots by string, by URI (Async) or by streams (Async)
Supports multiple user-agents and "*"
Supports Allow and Disallow
Supports Crawl-delay entries
Supports Sitemap entries
Supports wildcard paths (*) as well as must-end-with declarations ($)
Built-in "robots.txt" tokenization system (allowing extension to support other custom fields)
Built-in "robots.txt" validator (allowing to validate a tokenized file)
Dedicated parser for the data from <meta name="robots" /> tag and the X-Robots-Tag header

NoRobots RFC Compatibility

This library attempts to stick closely to the rules defined in the RFC document, including:

Global/any user-agent when none is explicitly defined (Section 3.2.1 of RFC)
Field names (eg. "User-agent") are character restricted (Section 3.3)
Allow/disallow rules are performed by order-of-occurence (Section 3.2.2)
Loading by URI applies default rules based on access to "robots.txt"…

Top comments (3)

Jamie • Jan 28 '19

How do you feel about the robots HTTP header?

For those who don't know, it's a header which you can include in page response which tells a web crawler what it's permitted to do with the page. It's not a replacement for the robots.txt, and (just like the robots.txt file) the web search companies don't have to support it.

An example of the robots header would be something like:

X-Robots-Tag: noarchive, nosnippet

This instructs a web crawler which finds the page that it is not permitted to archive the page or provide snippets from it (in search results).

James Turner • Jan 28 '19

I'm a bit torn by the robots header. On one hand, it allows really fine control on a per-page basis. On the other hand, you have to do a request to the page to find whether you are allowed to keep the data or not which feels like a waste of bandwidth.

I mean, you could do a HEAD request to find out but then you might end up with two HTTP requests just to get content in an "allowed" scenario.

That said, I do see value in the header. I'm actually building my own web crawler (which I will do another post about in the future) and I want to add support for the header.

Ben Halpern • Jan 7 '19

Nice overview, tool looks great.

DEV Community

No Robots Allowed

Welcome to the ring, robots.txt

No (sorry)

My Library

TurnerSoftware / RobotsExclusionTools

A "robots.txt" parsing and querying library in C#

Robots Exclusion Tools

Features

NoRobots RFC Compatibility

More Information

Top comments (3)

Read next

HTML Element vs HTML Tag

We have built a Tailwind CSS grid generator.

Horóscopo com IA: Uma Experiência com Next.js e Gemini

My 2024 Journey: Learning from My Mistakes as a Junior Dev