How to Read a robots.txt File: Rules & Examples

A robots.txt file is a plain-text file at the root of a website (always at /robots.txt) that tells automated crawlers which parts of the site they are asked not to access. Reading it is simple once you know the five directives: User-agent says who a rule applies to, Disallow and Allow say which paths are off-limits or permitted, Crawl-delay asks bots to slow down, and Sitemap points to the site's URL list. This guide walks through each, reads a real-world example line by line, explains what robots.txt does and does not mean for scrapers, and shows how to check it in code.

The file implements the Robots Exclusion Protocol, formalized as RFC 9309. If you build or run crawlers, reading robots.txt should be the first thing you do before requesting a site; it is a core part of responsible web crawling.

The robots.txt Directives

User-agent: names the crawler a block of rules applies to. A value of * means "all crawlers"; a specific name (like Googlebot) targets just that one.
Disallow: a path prefix the crawler is asked not to request. Disallow: /admin/ blocks everything under /admin/. An empty Disallow means nothing is blocked.
Allow: a path that is permitted even inside a disallowed directory — used to carve exceptions.
Crawl-delay: a requested number of seconds to wait between requests (honored by some crawlers, ignored by others).
Sitemap: the absolute URL of an XML sitemap listing the site's pages. This one is a gift to scrapers — it hands you the URL list directly.

Two wildcards refine path matching: * matches any sequence of characters, and $ anchors the end of a URL. So Disallow: /*.pdf$ blocks every URL ending in .pdf.

Reading a robots.txt Example Line by Line

Here is a typical robots.txt:

User-agent: *
Disallow: /admin/
Disallow: /cart
Allow: /products/
Crawl-delay: 5

User-agent: BadBot
Disallow: /

Sitemap: https://example.com/sitemap.xml

How to read it:

The first block applies to all crawlers (User-agent: *). They are asked not to access anything under /admin/ or paths starting with /cart, they are explicitly allowed under /products/, and they are asked to wait 5 seconds between requests.
The second block targets a specific bot named BadBot and disallows the entire site (Disallow: /). A crawler identifying as BadBot is asked to stay out completely.
The final line publishes the sitemap location, which any crawler can use to find the site's pages efficiently.

A crawler matches itself to the most specific User-agent block that names it; if none names it, the * block applies.

What robots.txt Means for Scrapers

This is where people get confused, so be precise about it:

It is a request, not a wall. robots.txt does not technically block anything — it is an instruction that well-behaved crawlers choose to follow. Nothing stops a client from ignoring it.
It is not a law by itself, but ignoring it can intersect with a site's Terms of Service and with legal exposure depending on jurisdiction and what you collect. Treat it as a clear signal of the site owner's wishes.
Reputable crawlers respect it. Search engines and responsible bots honor robots.txt. Google documents exactly how it interprets the file.
It is not security. Listing a path in Disallow does not hide it — if anything, it advertises which paths the owner considers sensitive. Never rely on robots.txt to protect private data.

The responsible default for any scraping project: read robots.txt first, avoid disallowed paths (especially anything resembling private or account areas), respect crawl-delay, and use the published sitemap. This is part of the broader practice of scraping considerately, and it applies just as much to AI scraping as to traditional crawling.

How to Check robots.txt in Python

Python's standard library can parse robots.txt and answer "am I allowed to fetch this URL?" for you:

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

ua = "MyResearchBot"
print(rp.can_fetch(ua, "https://example.com/products/123"))  # True
print(rp.can_fetch(ua, "https://example.com/admin/users"))   # False
print(rp.crawl_delay(ua))                                     # 5 (or None)

Wiring can_fetch() into your crawler before each request keeps you on the right side of the file automatically. You can also read the Sitemap lines to seed your URL frontier.

Common Misconceptions

"robots.txt hides pages from users." No — it only addresses automated crawlers. Anyone with the URL can still open the page in a browser.
"A disallowed page can't appear in Google." It still can, if other pages link to it; Disallow controls crawling, not necessarily indexing.
"robots.txt blocks scrapers technically." It does not. It is an honor-system request; enforcement is social and legal, not technical.
"One robots.txt covers all my subdomains." No — robots.txt is per host. shop.example.com needs its own file separate from example.com.

Frequently Asked Questions

How do I read a robots.txt file?

Open the site's domain followed by /robots.txt in a browser. Read it block by block: User-agent names who each block applies to, Disallow lists paths bots are asked to avoid, Allow carves out exceptions, Crawl-delay requests a pause between requests, and Sitemap gives the URL list. Match your crawler to the most specific User-agent block, or the * block if none names it.

Where is the robots.txt file located?

Always at the root of a host, at /robots.txt — for example https://example.com/robots.txt. It is per host, so subdomains and different protocols can each have their own file. If a site has no robots.txt, crawlers generally treat that as no restrictions.

Does robots.txt apply to web scrapers?

It applies to all automated crawlers, scrapers included, but only as a request — it does not technically block anything. Reputable scrapers honor it; ignoring it can intersect with a site's Terms of Service and, depending on jurisdiction and the data involved, legal risk. The responsible default is to read and respect it.

What does Disallow mean in robots.txt?

Disallow names a path prefix the crawler is asked not to request. Disallow: /admin/ asks bots to avoid everything under /admin/. An empty Disallow value means nothing is restricted for that User-agent block.

Is it illegal to ignore robots.txt?

robots.txt is not itself a law, so ignoring it is not automatically illegal. However, it can be relevant to legal questions around Terms of Service and unauthorized access, and it signals the owner's wishes. Treat it as a clear boundary and consult a lawyer for high-stakes or ambiguous scraping.

Can I parse robots.txt automatically?

Yes. Python's urllib.robotparser reads a robots.txt and answers can_fetch(user_agent, url) for each URL, and exposes crawl_delay(). Wiring it into your crawler before each request enforces the file's rules automatically, and you can read the Sitemap lines to discover URLs.

Conclusion

Reading a robots.txt file is straightforward: find it at /robots.txt, match your crawler to the right User-agent block, and follow the Disallow, Allow, and Crawl-delay lines while using the Sitemap to find pages. Just remember what it is — a request that responsible crawlers honor, not a technical block and not a security measure. Reading and respecting it is the baseline of ethical scraping.

Once you know which pages you may crawl, reliable access is the next requirement. SpyderProxy residential proxies start at $1.75/GB with 10M+ IPs across 195+ countries, automatic rotation, and city-level targeting — so your considerate crawler reaches the pages it is allowed to, without being rate-limited.