What's the difference between web crawling and web scraping?

Crawling = discovering and indexing URLs across many sites. Scraping = extracting specific data from a page. Crawlers find pages; scrapers extract data. Googlebot is a crawler; a price-monitoring bot that hits 1,000 product pages is a scraper. Most production tools do both: crawl to discover URLs, scrape to extract data from each.

How does Googlebot crawl the entire web?

Googlebot starts from known URLs (sitemaps, links from already-crawled pages, manually submitted URLs), fetches each page, parses the HTML to extract new URLs, adds those to a queue, and repeats. At Google's scale this is 1+ billion pages/day across 100,000+ servers. Each page gets crawled at intervals based on PageRank-style scoring.

What is robots.txt and do crawlers have to follow it?

robots.txt is a file at the root of a site (example.com/robots.txt) that tells crawlers which paths they can visit. Reputable crawlers (Googlebot, Bingbot) follow it; malicious crawlers ignore it. Following robots.txt is a sign of a well-behaved crawler — most legitimate scrapers respect it too.

How fast can I crawl a website without getting blocked?

Depends on the target's rate limit. Most sites tolerate 1-2 requests/second from a single IP. Aggressive crawling (10+ req/s) triggers 429 rate limits. With residential proxy rotation, you can effectively run faster because each request comes from a different IP, but always respect robots.txt and crawl-delay directives.

Do I need proxies for web crawling?

For small crawls (under 1,000 pages on cooperative sites), no — your IP is fine. For large-scale crawling (10K+ pages, multiple sites, anti-bot defenses), residential proxies are required. Each new IP avoids per-IP rate limits and reduces fingerprinting. See our why-companies-use-residential-proxies guide for the typical buyer profile.

Can I crawl sites that require login?

Crawling logged-in content is technically possible (with session cookies) but legally riskier. Public-facing pages are generally fair game per hiQ v. LinkedIn precedent. Login-protected pages usually involve a Terms of Service agreement; bypassing authentication can trigger CFAA exposure in the US.

How do I build a basic web crawler in Python?

Use requests + BeautifulSoup + a queue. Fetch a starting URL, parse HTML for <a href> links, add them to the queue, repeat. Track visited URLs in a set to avoid loops. Respect robots.txt with the urllib.robotparser module. Limit to one domain unless you specifically want multi-site crawling.

What's the difference between Scrapy and a custom Python crawler?

Scrapy is a full-featured framework with built-in queue management, retry logic, distributed crawling, and pipeline-based data extraction. A custom crawler is simpler but you reimplement everything. For projects under 100K pages, custom is fine. For production at scale, Scrapy or commercial tools (Apify, Bright Data Web Scraper) save weeks of work.

What Is Web Crawling? (vs Web Scraping)

Daniel K.

Wed May 06 2026

Quick verdict: Web crawling is the process of automatically discovering and indexing URLs across the web — that's what Googlebot does to map every accessible page on the internet. Web scraping is extracting specific data from pages. Crawlers find pages; scrapers extract data. Most production tools do both: crawl to discover URLs, scrape to extract from each. For large-scale crawling, residential proxies are required to avoid per-IP rate limits.

This guide covers how crawling works at the protocol level, the difference between crawling and scraping, how Googlebot crawls 1B+ pages per day, how to build a Python crawler, and why proxies are essential at scale.

What Is Web Crawling, Exactly?

A web crawler (also called a "spider" or "bot") is software that:

Starts with a seed URL (or list of URLs)
Fetches the page
Parses the HTML to find <a href> links
Adds new URLs to a queue
Repeats — pulling from the queue, fetching, parsing

The crawler doesn't necessarily extract any specific data — its job is to discover and inventory URLs. What you DO with each URL (extract title, save HTML, follow further) determines whether you're also scraping.

Crawling vs Scraping: The Distinction

	Web Crawling	Web Scraping
Goal	Discover URLs	Extract specific data
Scope	Many sites, broad	Specific pages, narrow
Output	URL index / list	Structured data (CSV, DB)
Examples	Googlebot, archive.org	Price monitor, news aggregator
Politeness	Follow robots.txt, crawl-delay	Often more aggressive

Most production scrapers do both. For a price-monitoring tool that tracks 100 retailers: the crawler discovers product URLs from each retailer's category pages, then the scraper extracts price/inventory from each product URL.

How Googlebot Crawls the Web

At Google's scale, crawling is engineered for politeness and prioritization:

Seed list: sitemaps + manually submitted URLs + links from already-crawled pages.
Distributed queue: URLs sharded across thousands of crawl servers, each handling a slice.
Politeness: respect robots.txt, follow Crawl-Delay, back off on 429.
Prioritization: high-PageRank pages crawl more often; low-PageRank pages crawl rarely.
Freshness budget: news sites and high-traffic pages re-crawled hourly; static pages re-crawled monthly.
JavaScript rendering: Googlebot has two-stage indexing — initial HTML crawl, then a delayed render in Chromium for JS-heavy sites.

Total scale: ~1 billion pages crawled per day, ~100 trillion URLs indexed.

Building a Basic Python Crawler

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
from urllib.robotparser import RobotFileParser
from collections import deque

PROXY = "http://USER:[email protected]:8080"
proxies = {"https": PROXY}

def can_crawl(url, user_agent="MyCrawler/1.0"):
    parsed = urlparse(url)
    rp = RobotFileParser()
    rp.set_url(f"{parsed.scheme}://{parsed.netloc}/robots.txt")
    try:
        rp.read()
    except:
        return True  # If robots.txt unreachable, allow
    return rp.can_fetch(user_agent, url)

def crawl(seed_url, max_pages=100, same_domain=True):
    visited = set()
    queue = deque([seed_url])
    seed_domain = urlparse(seed_url).netloc

    while queue and len(visited) < max_pages:
        url = queue.popleft()
        if url in visited or not can_crawl(url):
            continue
        try:
            r = requests.get(url, proxies=proxies, timeout=10)
            visited.add(url)
            print(f"Crawled: {url} [{r.status_code}]")
        except Exception as e:
            continue

        soup = BeautifulSoup(r.text, "lxml")
        for a in soup.find_all("a", href=True):
            link = urljoin(url, a["href"])
            link_domain = urlparse(link).netloc
            if same_domain and link_domain != seed_domain:
                continue
            if link not in visited:
                queue.append(link)

    return visited

urls = crawl("https://example.com", max_pages=500)
print(f"Discovered {len(urls)} URLs")

Why Crawlers Need Proxies

Three reasons:

Per-IP rate limits. Most sites rate-limit at 1-2 req/s/IP. Crawling 100K pages from one IP would take 14+ hours minimum. Rotating residential IPs lets you crawl in parallel.
IP type filtering. Sites with anti-bot defenses serve different content (or blocked content) to datacenter IPs. Residential proxies see the same pages real users see.
Geographic content. Some sites serve different HTML based on the visitor's country. Crawling from multiple countries via residential proxies in each country reveals geo-variant content.

Ethics & Compliance

Respect robots.txt. Even if not legally required (in most jurisdictions), it's the universal politeness protocol.
Honor Crawl-Delay. If robots.txt specifies Crawl-Delay: 5, wait 5 seconds between requests.
Identify yourself. Use a meaningful User-Agent (e.g., MyResearchCrawler/1.0 (+https://my-site.com/about)) so site operators can contact you.
Don't overload servers. Limit concurrent connections to a single domain (1-2 max). Use exponential backoff on 429.
Don't crawl behind login walls without authorization. CFAA exposure in the US, GDPR exposure in the EU.