spyderproxy

What Is Web Crawling? (vs Web Scraping)

D

Daniel K.

|
Published date

Wed May 06 2026

Quick verdict: Web crawling is the process of automatically discovering and indexing URLs across the web — that's what Googlebot does to map every accessible page on the internet. Web scraping is extracting specific data from pages. Crawlers find pages; scrapers extract data. Most production tools do both: crawl to discover URLs, scrape to extract from each. For large-scale crawling, residential proxies are required to avoid per-IP rate limits.

This guide covers how crawling works at the protocol level, the difference between crawling and scraping, how Googlebot crawls 1B+ pages per day, how to build a Python crawler, and why proxies are essential at scale.

What Is Web Crawling, Exactly?

A web crawler (also called a "spider" or "bot") is software that:

  1. Starts with a seed URL (or list of URLs)
  2. Fetches the page
  3. Parses the HTML to find <a href> links
  4. Adds new URLs to a queue
  5. Repeats — pulling from the queue, fetching, parsing

The crawler doesn't necessarily extract any specific data — its job is to discover and inventory URLs. What you DO with each URL (extract title, save HTML, follow further) determines whether you're also scraping.

Crawling vs Scraping: The Distinction

Web Crawling Web Scraping
GoalDiscover URLsExtract specific data
ScopeMany sites, broadSpecific pages, narrow
OutputURL index / listStructured data (CSV, DB)
ExamplesGooglebot, archive.orgPrice monitor, news aggregator
PolitenessFollow robots.txt, crawl-delayOften more aggressive

Most production scrapers do both. For a price-monitoring tool that tracks 100 retailers: the crawler discovers product URLs from each retailer's category pages, then the scraper extracts price/inventory from each product URL.

How Googlebot Crawls the Web

At Google's scale, crawling is engineered for politeness and prioritization:

  1. Seed list: sitemaps + manually submitted URLs + links from already-crawled pages.
  2. Distributed queue: URLs sharded across thousands of crawl servers, each handling a slice.
  3. Politeness: respect robots.txt, follow Crawl-Delay, back off on 429.
  4. Prioritization: high-PageRank pages crawl more often; low-PageRank pages crawl rarely.
  5. Freshness budget: news sites and high-traffic pages re-crawled hourly; static pages re-crawled monthly.
  6. JavaScript rendering: Googlebot has two-stage indexing — initial HTML crawl, then a delayed render in Chromium for JS-heavy sites.

Total scale: ~1 billion pages crawled per day, ~100 trillion URLs indexed.

Building a Basic Python Crawler

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
from urllib.robotparser import RobotFileParser
from collections import deque

PROXY = "http://USER:[email protected]:8080"
proxies = {"https": PROXY}

def can_crawl(url, user_agent="MyCrawler/1.0"):
    parsed = urlparse(url)
    rp = RobotFileParser()
    rp.set_url(f"{parsed.scheme}://{parsed.netloc}/robots.txt")
    try:
        rp.read()
    except:
        return True  # If robots.txt unreachable, allow
    return rp.can_fetch(user_agent, url)

def crawl(seed_url, max_pages=100, same_domain=True):
    visited = set()
    queue = deque([seed_url])
    seed_domain = urlparse(seed_url).netloc

    while queue and len(visited) < max_pages:
        url = queue.popleft()
        if url in visited or not can_crawl(url):
            continue
        try:
            r = requests.get(url, proxies=proxies, timeout=10)
            visited.add(url)
            print(f"Crawled: {url} [{r.status_code}]")
        except Exception as e:
            continue

        soup = BeautifulSoup(r.text, "lxml")
        for a in soup.find_all("a", href=True):
            link = urljoin(url, a["href"])
            link_domain = urlparse(link).netloc
            if same_domain and link_domain != seed_domain:
                continue
            if link not in visited:
                queue.append(link)

    return visited

urls = crawl("https://example.com", max_pages=500)
print(f"Discovered {len(urls)} URLs")

Why Crawlers Need Proxies

Three reasons:

  1. Per-IP rate limits. Most sites rate-limit at 1-2 req/s/IP. Crawling 100K pages from one IP would take 14+ hours minimum. Rotating residential IPs lets you crawl in parallel.
  2. IP type filtering. Sites with anti-bot defenses serve different content (or blocked content) to datacenter IPs. Residential proxies see the same pages real users see.
  3. Geographic content. Some sites serve different HTML based on the visitor's country. Crawling from multiple countries via residential proxies in each country reveals geo-variant content.

Ethics & Compliance

  • Respect robots.txt. Even if not legally required (in most jurisdictions), it's the universal politeness protocol.
  • Honor Crawl-Delay. If robots.txt specifies Crawl-Delay: 5, wait 5 seconds between requests.
  • Identify yourself. Use a meaningful User-Agent (e.g., MyResearchCrawler/1.0 (+https://my-site.com/about)) so site operators can contact you.
  • Don't overload servers. Limit concurrent connections to a single domain (1-2 max). Use exponential backoff on 429.
  • Don't crawl behind login walls without authorization. CFAA exposure in the US, GDPR exposure in the EU.