Quick verdict: Web crawling is the process of automatically discovering and indexing URLs across the web — that's what Googlebot does to map every accessible page on the internet. Web scraping is extracting specific data from pages. Crawlers find pages; scrapers extract data. Most production tools do both: crawl to discover URLs, scrape to extract from each. For large-scale crawling, residential proxies are required to avoid per-IP rate limits.
This guide covers how crawling works at the protocol level, the difference between crawling and scraping, how Googlebot crawls 1B+ pages per day, how to build a Python crawler, and why proxies are essential at scale.
A web crawler (also called a "spider" or "bot") is software that:
<a href> linksThe crawler doesn't necessarily extract any specific data — its job is to discover and inventory URLs. What you DO with each URL (extract title, save HTML, follow further) determines whether you're also scraping.
| Web Crawling | Web Scraping | |
|---|---|---|
| Goal | Discover URLs | Extract specific data |
| Scope | Many sites, broad | Specific pages, narrow |
| Output | URL index / list | Structured data (CSV, DB) |
| Examples | Googlebot, archive.org | Price monitor, news aggregator |
| Politeness | Follow robots.txt, crawl-delay | Often more aggressive |
Most production scrapers do both. For a price-monitoring tool that tracks 100 retailers: the crawler discovers product URLs from each retailer's category pages, then the scraper extracts price/inventory from each product URL.
At Google's scale, crawling is engineered for politeness and prioritization:
Total scale: ~1 billion pages crawled per day, ~100 trillion URLs indexed.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
from urllib.robotparser import RobotFileParser
from collections import deque
PROXY = "http://USER:[email protected]:8080"
proxies = {"https": PROXY}
def can_crawl(url, user_agent="MyCrawler/1.0"):
parsed = urlparse(url)
rp = RobotFileParser()
rp.set_url(f"{parsed.scheme}://{parsed.netloc}/robots.txt")
try:
rp.read()
except:
return True # If robots.txt unreachable, allow
return rp.can_fetch(user_agent, url)
def crawl(seed_url, max_pages=100, same_domain=True):
visited = set()
queue = deque([seed_url])
seed_domain = urlparse(seed_url).netloc
while queue and len(visited) < max_pages:
url = queue.popleft()
if url in visited or not can_crawl(url):
continue
try:
r = requests.get(url, proxies=proxies, timeout=10)
visited.add(url)
print(f"Crawled: {url} [{r.status_code}]")
except Exception as e:
continue
soup = BeautifulSoup(r.text, "lxml")
for a in soup.find_all("a", href=True):
link = urljoin(url, a["href"])
link_domain = urlparse(link).netloc
if same_domain and link_domain != seed_domain:
continue
if link not in visited:
queue.append(link)
return visited
urls = crawl("https://example.com", max_pages=500)
print(f"Discovered {len(urls)} URLs")
Three reasons:
MyResearchCrawler/1.0 (+https://my-site.com/about)) so site operators can contact you.