A robots.txt file is a plain-text file at the root of a website (always at /robots.txt) that tells automated crawlers which parts of the site they are asked not to access. Reading it is simple once you know the five directives: User-agent says who a rule applies to, Disallow and Allow say which paths are off-limits or permitted, Crawl-delay asks bots to slow down, and Sitemap points to the site's URL list. This guide walks through each, reads a real-world example line by line, explains what robots.txt does and does not mean for scrapers, and shows how to check it in code.
The file implements the Robots Exclusion Protocol, formalized as RFC 9309. If you build or run crawlers, reading robots.txt should be the first thing you do before requesting a site; it is a core part of responsible web crawling.
Two wildcards refine path matching: * matches any sequence of characters, and $ anchors the end of a URL. So Disallow: /*.pdf$ blocks every URL ending in .pdf.
Here is a typical robots.txt:
User-agent: *
Disallow: /admin/
Disallow: /cart
Allow: /products/
Crawl-delay: 5
User-agent: BadBot
Disallow: /
Sitemap: https://example.com/sitemap.xml
How to read it:
A crawler matches itself to the most specific User-agent block that names it; if none names it, the * block applies.
This is where people get confused, so be precise about it:
The responsible default for any scraping project: read robots.txt first, avoid disallowed paths (especially anything resembling private or account areas), respect crawl-delay, and use the published sitemap. This is part of the broader practice of scraping considerately, and it applies just as much to AI scraping as to traditional crawling.
Python's standard library can parse robots.txt and answer "am I allowed to fetch this URL?" for you:
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
ua = "MyResearchBot"
print(rp.can_fetch(ua, "https://example.com/products/123")) # True
print(rp.can_fetch(ua, "https://example.com/admin/users")) # False
print(rp.crawl_delay(ua)) # 5 (or None)
Wiring can_fetch() into your crawler before each request keeps you on the right side of the file automatically. You can also read the Sitemap lines to seed your URL frontier.
Open the site's domain followed by /robots.txt in a browser. Read it block by block: User-agent names who each block applies to, Disallow lists paths bots are asked to avoid, Allow carves out exceptions, Crawl-delay requests a pause between requests, and Sitemap gives the URL list. Match your crawler to the most specific User-agent block, or the * block if none names it.
Always at the root of a host, at /robots.txt — for example https://example.com/robots.txt. It is per host, so subdomains and different protocols can each have their own file. If a site has no robots.txt, crawlers generally treat that as no restrictions.
It applies to all automated crawlers, scrapers included, but only as a request — it does not technically block anything. Reputable scrapers honor it; ignoring it can intersect with a site's Terms of Service and, depending on jurisdiction and the data involved, legal risk. The responsible default is to read and respect it.
Disallow names a path prefix the crawler is asked not to request. Disallow: /admin/ asks bots to avoid everything under /admin/. An empty Disallow value means nothing is restricted for that User-agent block.
robots.txt is not itself a law, so ignoring it is not automatically illegal. However, it can be relevant to legal questions around Terms of Service and unauthorized access, and it signals the owner's wishes. Treat it as a clear boundary and consult a lawyer for high-stakes or ambiguous scraping.
Yes. Python's urllib.robotparser reads a robots.txt and answers can_fetch(user_agent, url) for each URL, and exposes crawl_delay(). Wiring it into your crawler before each request enforces the file's rules automatically, and you can read the Sitemap lines to discover URLs.
Reading a robots.txt file is straightforward: find it at /robots.txt, match your crawler to the right User-agent block, and follow the Disallow, Allow, and Crawl-delay lines while using the Sitemap to find pages. Just remember what it is — a request that responsible crawlers honor, not a technical block and not a security measure. Reading and respecting it is the baseline of ethical scraping.
Once you know which pages you may crawl, reliable access is the next requirement. SpyderProxy residential proxies start at $1.75/GB with 10M+ IPs across 195+ countries, automatic rotation, and city-level targeting — so your considerate crawler reaches the pages it is allowed to, without being rate-limited.