With the rise of AI web crawlers, many sites are looking for ways to control how their content is used for AI training. While robots.txt has been the standard for traditional crawlers, there's a growing adoption of LLMs.txt for AI-specific directives.
What is LLMs.txt?
LLMs.txt is a proposed standard (similar to robots.txt) that lets website owners specify:
- Whether AI models can train on their content
- Which parts of the site are allowed/disallowed for training
- Attribution requirements
- Rate limiting for AI crawlers
Quick Implementation Guide
Add an LLMs.txt file to your root directory:
# Allow training but require attribution
Allow: /blog/*
Attribution: Required
Company: YourCompany
# Disallow training on specific sections
Disallow: /private/*
Disallow: /premium/*
# Rate limiting
Crawling-Rate: 10r/m
Real-World Examples
I looked at how major tech companies implement LLMs.txt.
You can check it here: https://llmstxt.site/
Here are some interesting patterns I found:
- Most companies allow training on public blog content
- Documentation is commonly restricted
- Premium content is usually disallowed
Best Practices
- Start with a default policy
- Be explicit about attribution
- Consider rate limits
- Review periodically
Getting Started
Just create an LLMs.txt file in your site's root directory.
Here is my llms.txt: https://gleam.so/llms.txt
What are your thoughts on LLMs.txt? Are you planning to implement it on your sites?
Top comments (0)