spyderproxy

How to Scrape Amazon Without Getting Blocked: Definitive Guide

S

SpyderProxy Team

|
Published date

2026-03-30

Amazon is the largest e-commerce marketplace on the planet, with over 350 million products listed across dozens of regional storefronts. For businesses that depend on competitive pricing intelligence, product research, or market analysis, Amazon's product data is an invaluable resource.

But here is the problem: Amazon invests heavily in anti-scraping technology. Naive scraping attempts get blocked within minutes, sometimes seconds. IP bans, CAPTCHAs, and behavioral analysis systems are all designed to stop automated access in its tracks.

This guide covers everything you need to know about scraping Amazon reliably in 2026, from understanding why you get blocked in the first place to building a production-grade scraper that uses proxy rotation, anti-detection techniques, and intelligent request management to gather data at scale.

Why Scrape Amazon?

Before diving into the technical details, it is worth understanding why so many businesses scrape Amazon data in the first place. The use cases are diverse, but they all share one common thread: data-driven decision making.

Price Monitoring and Dynamic Pricing

Pricing on Amazon changes constantly. Sellers adjust prices multiple times per day based on competition, demand, and inventory levels. If you sell on Amazon (or compete against Amazon sellers), automated price monitoring lets you:

  • Track competitor prices across thousands of SKUs in real time
  • Trigger automatic repricing based on market conditions
  • Identify pricing trends over time to inform inventory purchasing
  • Detect MAP (Minimum Advertised Price) violations by unauthorized resellers

Without automated scraping, keeping tabs on even a few hundred products becomes a full-time job.

Competitor Analysis

Understanding what your competitors are doing on Amazon provides a serious strategic advantage. Scraping enables you to monitor:

  • New product launches and catalog expansion
  • Changes to product titles, bullet points, and descriptions (listing optimization signals)
  • Seller ratings and review velocity
  • Advertising placement and sponsored product activity
  • Best Seller Rank (BSR) changes over time

Product Research and Market Validation

For brands looking to launch new products, Amazon data is one of the best sources for market validation. You can analyze:

  • Demand signals through BSR and review counts
  • Price points that the market will bear
  • Gaps in existing product offerings based on negative reviews
  • Keyword search volume based on Amazon autocomplete data

Review Analysis and Sentiment Tracking

Customer reviews on Amazon represent millions of unfiltered opinions about products in every category imaginable. Scraping reviews enables:

  • Sentiment analysis at scale across product categories
  • Feature-level feedback extraction (what do customers love or hate?)
  • Quality monitoring for your own products
  • Competitive benchmarking based on customer satisfaction

Why Amazon Blocks Scrapers

Amazon does not block scrapers out of spite. There are legitimate technical and business reasons behind their anti-bot systems. Understanding these reasons helps you build scrapers that avoid triggering detection in the first place.

Rate-Based Detection

The simplest detection method is rate limiting. Amazon monitors the number of requests coming from each IP address. When a single IP sends hundreds of requests per minute, far exceeding what any human user would generate, it gets flagged automatically.

Rate-based detection looks at:

  • Requests per minute/hour from a single IP
  • Request patterns such as perfectly uniform intervals between requests
  • Total volume of pages accessed in a session
  • Request velocity or sudden spikes in traffic from a source

Browser Fingerprinting

Modern anti-bot systems go far beyond IP-based detection. Amazon uses browser fingerprinting to identify automated traffic by analyzing dozens of technical signals:

  • TLS fingerprint (JA3/JA4): The way your HTTP client negotiates SSL connections creates a unique signature. Python's requests library has a very different TLS fingerprint than Chrome.
  • HTTP/2 fingerprint: The order and values of HTTP/2 settings frames reveal what client is making the request.
  • Header order and values: Browsers send headers in a specific order. Scrapers often get this wrong.
  • JavaScript environment: When Amazon serves JavaScript challenges, they check for properties that only real browsers have (WebGL renderer, canvas fingerprint, audio context, etc.).

CAPTCHAs and Challenge Pages

When Amazon suspects automated traffic but is not certain, it serves a CAPTCHA challenge page. You will recognize this as a page asking you to solve an image puzzle or type characters from a distorted image. These challenges are designed to be easy for humans and difficult for bots.

Amazon typically serves CAPTCHAs when:

  • An IP address has a borderline reputation score
  • Request patterns are slightly suspicious but not conclusive
  • The user agent or headers do not perfectly match a known browser
  • You are accessing pages that are commonly targeted by bots (best seller lists, deal pages)

IP Reputation Systems

Amazon maintains databases of IP reputation scores, both internally and through third-party services. Certain IP ranges are pre-flagged as high-risk:

  • Data center IPs from providers like AWS, Google Cloud, DigitalOcean, and Hetzner are almost always flagged immediately. Amazon knows that real shoppers do not browse from data centers.
  • Previously abused IPs that have been associated with scraping or spam carry a negative reputation.
  • VPN exit nodes are cataloged and treated with suspicion.
  • Shared hosting IPs that host many different automated scripts.

This is precisely why proxy selection matters so much for Amazon scraping, and why residential proxies are the standard recommendation for this use case.

The Role of Proxies in Amazon Scraping

A proxy server acts as an intermediary between your scraper and Amazon. Instead of Amazon seeing your real IP address, it sees the IP of the proxy. This is foundational to any serious Amazon scraping operation for three reasons.

IP Rotation Prevents Rate Limiting

By rotating through a pool of proxy IPs, you distribute your requests across many different addresses. If you have access to a pool of 10,000 residential IPs, each IP only needs to handle a small fraction of your total request volume. Amazon sees what appears to be thousands of individual users browsing normally.

Residential IPs Avoid Reputation Flags

Residential proxies route your traffic through real consumer IP addresses assigned by ISPs. To Amazon's detection systems, requests from residential IPs look identical to requests from genuine shoppers. This is the single biggest advantage over data center proxies, which are flagged on sight.

Geo-Targeting Enables Localized Data Collection

Amazon operates separate storefronts for different countries: amazon.com, amazon.co.uk, amazon.de, amazon.co.jp, amazon.in, and many more. Product availability, pricing, and reviews vary by region. By using proxies located in specific countries, you can access each storefront as a local user would, collecting accurate localized data.

For example, scraping amazon.de with a US-based IP may return different results than scraping it with a German residential IP. Geo-targeting ensures data accuracy.

Step-by-Step: Setting Up Amazon Scraping with Proxies

This section walks through the practical setup process from proxy selection to working code.

Step 1: Choosing the Right Proxy Type

Not all proxies are created equal. Here is a breakdown of the main types and their suitability for Amazon scraping:

Proxy TypeSuccess Rate on AmazonCostSpeedRecommendation
Data center proxiesVery low (10-20%)LowFastNot recommended
Residential proxiesHigh (85-95%)MediumMediumStrongly recommended
ISP proxiesHigh (80-90%)Medium-HighFastGood alternative
Mobile proxiesVery high (90-98%)HighVariableBest for hardest targets

Residential proxies are the standard choice for Amazon scraping. They offer the best balance of success rate, cost, and scalability. Data center proxies are essentially useless against Amazon's detection systems in 2026 because their IP ranges are well-known and immediately flagged.

When evaluating a residential proxy provider, look for:

  • Large IP pool size (millions of IPs across many countries)
  • Granular geo-targeting (country, state, and city level)
  • Rotation options (automatic rotation per request or sticky sessions)
  • Low latency and high uptime
  • Pay-per-GB pricing so you only pay for what you use

SpyderProxy offers residential proxies with a pool of over 10 million IPs across 195+ countries, with both rotating and sticky session options that are well-suited for Amazon scraping at any scale.

Step 2: Python Environment Setup

Set up a clean Python environment for your scraper:

# Create a virtual environment
python -m venv amazon-scraper
source amazon-scraper/bin/activate  # On Windows: amazon-scraper\Scripts\activate

# Install dependencies
pip install requests beautifulsoup4 lxml fake-useragent selenium

Step 3: Basic Proxy Configuration

Here is how to configure proxy rotation with a residential proxy service. Most providers, including SpyderProxy, support HTTP/HTTPS proxy protocols with authentication:

import requests
from bs4 import BeautifulSoup

# SpyderProxy configuration
PROXY_HOST = "geo.spyderproxy.com"
PROXY_PORT = "11200"
PROXY_USER = "your_username"
PROXY_PASS = "your_password"

proxy_url = f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}"

proxies = {
    "http": proxy_url,
    "https": proxy_url,
}

# Make a request through the proxy
response = requests.get(
    "https://www.amazon.com/dp/B09V3KXJPB",
    proxies=proxies,
    timeout=30,
)

print(f"Status: {response.status_code}")
print(f"Page length: {len(response.text)} characters")

Step 4: Proxy Rotation Configuration

For large-scale scraping, you need to rotate IPs automatically. Most residential proxy providers handle rotation server-side. With SpyderProxy, you can force a new IP on each request by appending a session identifier:

import random
import string
import requests

PROXY_HOST = "geo.spyderproxy.com"
PROXY_PORT = "11200"
PROXY_USER = "your_username"
PROXY_PASS = "your_password"

def get_rotating_proxy():
    """Generate a proxy URL with a random session ID to force IP rotation."""
    session_id = "".join(random.choices(string.ascii_lowercase + string.digits, k=8))
    proxy_url = (
        f"http://{PROXY_USER}-session-{session_id}:{PROXY_PASS}"
        f"@{PROXY_HOST}:{PROXY_PORT}"
    )
    return {"http": proxy_url, "https": proxy_url}

def get_sticky_proxy(session_id: str):
    """Use the same IP for multiple requests within a session."""
    proxy_url = (
        f"http://{PROXY_USER}-session-{session_id}:{PROXY_PASS}"
        f"@{PROXY_HOST}:{PROXY_PORT}"
    )
    return {"http": proxy_url, "https": proxy_url}

Use rotating proxies for independent page fetches (product pages, search results). Use sticky sessions when you need to maintain state across multiple requests, such as navigating pagination or following a sequence of pages that Amazon expects to come from the same user.

Step 5: Country Targeting for Localized Amazon Data

To scrape regional Amazon stores with accurate localized data, specify the target country in your proxy configuration:

def get_geo_targeted_proxy(country_code: str):
    """
    Route requests through a proxy in a specific country.

    Supported codes: us, gb, de, fr, jp, in, ca, au, etc.
    """
    proxy_url = (
        f"http://{PROXY_USER}-country-{country_code}:{PROXY_PASS}"
        f"@{PROXY_HOST}:{PROXY_PORT}"
    )
    return {"http": proxy_url, "https": proxy_url}

# Scrape Amazon Germany with a German residential IP
de_proxies = get_geo_targeted_proxy("de")
response = requests.get("https://www.amazon.de/dp/B09V3KXJPB", proxies=de_proxies)

# Scrape Amazon UK with a British residential IP
gb_proxies = get_geo_targeted_proxy("gb")
response = requests.get("https://www.amazon.co.uk/dp/B09V3KXJPB", proxies=gb_proxies)

# Scrape Amazon Japan with a Japanese residential IP
jp_proxies = get_geo_targeted_proxy("jp")
response = requests.get("https://www.amazon.co.jp/dp/B09V3KXJPB", proxies=jp_proxies)

This matters because Amazon tailors product availability, pricing, shipping options, and even which sellers are shown based on the geographic location of the visitor.

Advanced Anti-Detection Techniques

Using proxies is necessary but not sufficient. To maintain high success rates, you need to make your scraper's traffic indistinguishable from a real browser. Here are the techniques that matter most.

User-Agent Rotation

Every HTTP request includes a User-Agent header that identifies the client. Sending the same User-Agent string on every request is a clear signal of automation. Rotate through realistic, up-to-date User-Agent strings:

from fake_useragent import UserAgent
import random

ua = UserAgent(browsers=["Chrome", "Edge"])

def get_realistic_headers():
    """Generate headers that closely mimic a real browser."""
    user_agent = ua.random

    # Determine browser type from UA string for consistent headers
    is_chrome = "Chrome" in user_agent and "Edg" not in user_agent

    headers = {
        "User-Agent": user_agent,
        "Accept": (
            "text/html,application/xhtml+xml,application/xml;"
            "q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8"
        ),
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Cache-Control": "max-age=0",
    }

    if is_chrome:
        headers["sec-ch-ua"] = (
            '"Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"'
        )
        headers["sec-ch-ua-mobile"] = "?0"
        headers["sec-ch-ua-platform"] = '"Windows"'

    return headers

Request Timing Randomization

Humans do not browse at perfectly regular intervals. Adding randomized delays between requests is critical:

import time
import random

def human_delay(min_seconds=1.5, max_seconds=5.0):
    """Simulate human-like browsing delays."""
    # Use a log-normal distribution for more realistic timing
    delay = random.lognormvariate(0.5, 0.5)
    delay = max(min_seconds, min(delay, max_seconds))
    time.sleep(delay)

def cautious_delay():
    """Longer delay for use after receiving a warning signal."""
    time.sleep(random.uniform(10, 30))

Do not underestimate the importance of this. Many scrapers that use good proxies still get blocked because their request timing is unnaturally uniform.

Header Fingerprint Matching

Amazon inspects the full set of HTTP headers, not just the User-Agent. The order of headers matters, and inconsistencies between the User-Agent and other headers (like sec-ch-ua) will raise flags.

Key principles:

  • Header order should match the claimed browser. Chrome, Firefox, and Edge each send headers in a different order.
  • Include all headers a real browser would send. Missing Sec-Fetch-* headers on a Chrome User-Agent is suspicious.
  • Stay consistent within a session. If you start with a Chrome UA, do not switch to Firefox mid-session.
  • Include a realistic Referer header when navigating between pages. If you are visiting a product page, the Referer should be a search results page or Amazon's homepage.

JavaScript Rendering with Headless Browsers

Some Amazon pages require JavaScript execution to load product data. Amazon also uses JavaScript-based fingerprinting to detect bots. When you encounter pages that return incomplete data or challenge pages, switch to a headless browser:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import random

def create_stealth_driver(proxy_url: str = None):
    """Create a Selenium WebDriver with anti-detection measures."""
    options = Options()

    # Core stealth settings
    options.add_argument("--disable-blink-features=AutomationControlled")
    options.add_argument("--disable-infobars")
    options.add_argument("--window-size=1920,1080")
    options.add_argument("--lang=en-US")

    # Proxy configuration
    if proxy_url:
        options.add_argument(f"--proxy-server={proxy_url}")

    # Disable automation flags
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option("useAutomationExtension", False)

    driver = webdriver.Chrome(options=options)

    # Override navigator.webdriver property
    driver.execute_cdp_cmd(
        "Page.addScriptToEvaluateOnNewDocument",
        {
            "source": """
                Object.defineProperty(navigator, 'webdriver', {
                    get: () => undefined
                });

                // Override chrome runtime
                window.chrome = { runtime: {} };

                // Override permissions
                const originalQuery = window.navigator.permissions.query;
                window.navigator.permissions.query = (parameters) => (
                    parameters.name === 'notifications'
                        ? Promise.resolve({ state: Notification.permission })
                        : originalQuery(parameters)
                );
            """
        },
    )

    return driver

Session Management for Multi-Page Scraping

When scraping related pages (such as all reviews for a product, or paginated search results), you should maintain session consistency. This means using the same IP, cookies, and headers across related requests:

import requests

class AmazonSession:
    """Manage a consistent session for multi-page scraping."""

    def __init__(self, proxy_session_id: str, country: str = "us"):
        self.session = requests.Session()
        self.proxy_session_id = proxy_session_id

        # Set sticky proxy for this session
        proxy_url = (
            f"http://{PROXY_USER}-session-{proxy_session_id}"
            f"-country-{country}:{PROXY_PASS}"
            f"@{PROXY_HOST}:{PROXY_PORT}"
        )
        self.session.proxies = {"http": proxy_url, "https": proxy_url}
        self.session.headers.update(get_realistic_headers())

    def get_product_page(self, asin: str):
        """Fetch a product page."""
        url = f"https://www.amazon.com/dp/{asin}"
        human_delay()
        return self.session.get(url, timeout=30)

    def get_reviews(self, asin: str, page: int = 1):
        """Fetch product reviews with proper Referer."""
        self.session.headers["Referer"] = f"https://www.amazon.com/dp/{asin}"
        url = (
            f"https://www.amazon.com/product-reviews/{asin}"
            f"?pageNumber={page}"
        )
        human_delay()
        return self.session.get(url, timeout=30)

    def close(self):
        self.session.close()

Code Examples

Here are two complete, working examples: one using requests + BeautifulSoup for lightweight scraping, and one using Selenium for JavaScript-heavy pages.

Example 1: Product Data Scraper (Requests + BeautifulSoup)

"""
Amazon product scraper using requests and BeautifulSoup.
Extracts product title, price, rating, and review count.
"""

import requests
from bs4 import BeautifulSoup
import random
import string
import time
import json
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Proxy configuration
PROXY_HOST = "geo.spyderproxy.com"
PROXY_PORT = "11200"
PROXY_USER = "your_username"
PROXY_PASS = "your_password"

def get_proxy():
    """Get a rotating proxy with a random session ID."""
    sid = "".join(random.choices(string.ascii_lowercase + string.digits, k=8))
    url = (
        f"http://{PROXY_USER}-session-{sid}:{PROXY_PASS}"
        f"@{PROXY_HOST}:{PROXY_PORT}"
    )
    return {"http": url, "https": url}

def get_headers():
    """Return realistic browser headers."""
    user_agents = [
        (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
            " (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
        ),
        (
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
            " (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
        ),
        (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
            " (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Edg/124.0.0.0"
        ),
    ]
    return {
        "User-Agent": random.choice(user_agents),
        "Accept": (
            "text/html,application/xhtml+xml,application/xml;"
            "q=0.9,image/avif,image/webp,*/*;q=0.8"
        ),
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1",
    }

def parse_product_page(html: str) -> dict:
    """Extract product data from an Amazon product page."""
    soup = BeautifulSoup(html, "lxml")
    data = {}

    # Product title
    title_el = soup.find("span", {"id": "productTitle"})
    data["title"] = title_el.get_text(strip=True) if title_el else None

    # Price
    price_el = soup.find("span", {"class": "a-price-whole"})
    price_frac = soup.find("span", {"class": "a-price-fraction"})
    if price_el:
        whole = price_el.get_text(strip=True).replace(".", "").replace(",", "")
        fraction = price_frac.get_text(strip=True) if price_frac else "00"
        data["price"] = f"{whole}.{fraction}"
    else:
        data["price"] = None

    # Rating
    rating_el = soup.find("span", {"class": "a-icon-alt"})
    if rating_el and "out of" in rating_el.get_text():
        data["rating"] = rating_el.get_text(strip=True)
    else:
        data["rating"] = None

    # Review count
    review_el = soup.find("span", {"id": "acrCustomerReviewCount"})
    data["review_count"] = review_el.get_text(strip=True) if review_el else None

    # Availability
    avail_el = soup.find("div", {"id": "availability"})
    data["availability"] = avail_el.get_text(strip=True) if avail_el else None

    # ASIN
    asin_el = soup.find("input", {"id": "ASIN"})
    data["asin"] = asin_el["value"] if asin_el else None

    return data

def scrape_product(asin: str, max_retries: int = 3) -> dict:
    """Scrape a single Amazon product page with retry logic."""
    url = f"https://www.amazon.com/dp/{asin}"

    for attempt in range(max_retries):
        try:
            proxy = get_proxy()
            headers = get_headers()

            response = requests.get(
                url, headers=headers, proxies=proxy, timeout=30
            )

            if response.status_code == 200:
                if "captcha" in response.text.lower():
                    logger.warning(
                        f"CAPTCHA detected on attempt {attempt + 1} for {asin}"
                    )
                    time.sleep(random.uniform(5, 15))
                    continue

                product_data = parse_product_page(response.text)
                product_data["url"] = url
                product_data["status"] = "success"
                logger.info(f"Successfully scraped {asin}: {product_data['title']}")
                return product_data

            elif response.status_code == 503:
                logger.warning(f"503 response for {asin}, retrying...")
                time.sleep(random.uniform(3, 10))

            elif response.status_code == 404:
                logger.info(f"Product {asin} not found (404)")
                return {"asin": asin, "status": "not_found"}

            else:
                logger.warning(
                    f"Status {response.status_code} for {asin}"
                    f" on attempt {attempt + 1}"
                )

        except requests.exceptions.RequestException as e:
            logger.error(f"Request error for {asin}: {e}")
            time.sleep(random.uniform(2, 5))

    return {"asin": asin, "status": "failed"}

def scrape_multiple_products(asins: list) -> list:
    """Scrape multiple products with delays between requests."""
    results = []
    for i, asin in enumerate(asins):
        logger.info(f"Scraping product {i + 1}/{len(asins)}: {asin}")
        result = scrape_product(asin)
        results.append(result)

        # Random delay between products
        if i < len(asins) - 1:
            delay = random.uniform(2.0, 6.0)
            time.sleep(delay)

    return results

# Usage
if __name__ == "__main__":
    asins_to_scrape = [
        "B09V3KXJPB",
        "B0BSHF7WHW",
        "B0D5926JJH",
    ]

    results = scrape_multiple_products(asins_to_scrape)

    # Save results
    with open("amazon_products.json", "w") as f:
        json.dump(results, f, indent=2)

    # Summary
    successful = sum(1 for r in results if r.get("status") == "success")
    print(f"\nScraped {successful}/{len(results)} products successfully")

Example 2: Selenium-Based Scraper for JavaScript-Heavy Pages

"""
Amazon scraper using Selenium for pages that require JavaScript rendering.
Handles dynamic content loading, infinite scroll, and CAPTCHA detection.
"""

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
import time
import random
import json
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# SpyderProxy configuration
PROXY_HOST = "geo.spyderproxy.com"
PROXY_PORT = "11200"
PROXY_USER = "your_username"
PROXY_PASS = "your_password"

def create_driver(country: str = "us"):
    """Create a stealth Chrome driver routed through SpyderProxy."""
    session_id = "".join(
        random.choices("abcdefghijklmnopqrstuvwxyz0123456789", k=8)
    )
    proxy_addr = (
        f"{PROXY_HOST}:{PROXY_PORT}"
    )

    options = Options()
    options.add_argument(f"--proxy-server=http://{proxy_addr}")
    options.add_argument("--disable-blink-features=AutomationControlled")
    options.add_argument("--window-size=1920,1080")
    options.add_argument("--disable-infobars")
    options.add_argument("--lang=en-US")
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option("useAutomationExtension", False)

    driver = webdriver.Chrome(options=options)

    # Remove webdriver flag
    driver.execute_cdp_cmd(
        "Page.addScriptToEvaluateOnNewDocument",
        {
            "source": """
                Object.defineProperty(navigator, 'webdriver', {
                    get: () => undefined
                });
            """
        },
    )

    return driver

def simulate_human_browsing(driver):
    """Simulate human-like scrolling and mouse movement."""
    # Scroll down gradually
    total_height = driver.execute_script("return document.body.scrollHeight")
    current_position = 0

    while current_position < total_height * 0.7:
        scroll_amount = random.randint(200, 500)
        current_position += scroll_amount
        driver.execute_script(f"window.scrollTo(0, {current_position});")
        time.sleep(random.uniform(0.3, 1.0))

def scrape_search_results(query: str, max_pages: int = 3) -> list:
    """Scrape Amazon search results for a given query."""
    driver = create_driver()
    all_products = []

    try:
        for page in range(1, max_pages + 1):
            url = (
                f"https://www.amazon.com/s?k={query.replace(' ', '+')}"
                f"&page={page}"
            )
            logger.info(f"Scraping search page {page}: {url}")
            driver.get(url)

            # Wait for search results to load
            try:
                WebDriverWait(driver, 15).until(
                    EC.presence_of_element_located(
                        (By.CSS_SELECTOR, "[data-component-type='s-search-result']")
                    )
                )
            except TimeoutException:
                logger.warning(f"Timeout waiting for results on page {page}")
                # Check for CAPTCHA
                if "captcha" in driver.page_source.lower():
                    logger.error("CAPTCHA detected. Stopping.")
                    break
                continue

            # Simulate human browsing
            simulate_human_browsing(driver)

            # Parse results
            soup = BeautifulSoup(driver.page_source, "lxml")
            results = soup.find_all(
                "div", {"data-component-type": "s-search-result"}
            )

            for result in results:
                product = {}
                product["asin"] = result.get("data-asin", "")

                # Title
                title_el = result.find(
                    "span", {"class": "a-text-normal"}
                )
                product["title"] = (
                    title_el.get_text(strip=True) if title_el else None
                )

                # Price
                price_whole = result.find("span", {"class": "a-price-whole"})
                price_frac = result.find("span", {"class": "a-price-fraction"})
                if price_whole:
                    whole = price_whole.get_text(strip=True).replace(".", "")
                    frac = (
                        price_frac.get_text(strip=True) if price_frac else "00"
                    )
                    product["price"] = f"${whole}.{frac}"
                else:
                    product["price"] = None

                # Rating
                rating_el = result.find("span", {"class": "a-icon-alt"})
                product["rating"] = (
                    rating_el.get_text(strip=True) if rating_el else None
                )

                # Review count
                review_link = result.find(
                    "a", {"class": "a-link-normal s-underline-text"}
                )
                product["reviews"] = (
                    review_link.get_text(strip=True) if review_link else None
                )

                if product["asin"]:
                    all_products.append(product)

            logger.info(
                f"Page {page}: found {len(results)} products"
                f" ({len(all_products)} total)"
            )

            # Delay between pages
            if page < max_pages:
                time.sleep(random.uniform(3, 8))

    finally:
        driver.quit()

    return all_products

# Usage
if __name__ == "__main__":
    products = scrape_search_results("wireless headphones", max_pages=3)

    with open("search_results.json", "w") as f:
        json.dump(products, f, indent=2)

    print(f"Scraped {len(products)} products total")

Scaling Your Amazon Scraper

Once your scraper works reliably for small batches, you will eventually need to scale it to handle thousands or millions of product pages. Here is how to approach that.

Concurrent Requests with Thread Pools

Use Python's concurrent.futures module to run multiple requests in parallel while respecting rate limits:

from concurrent.futures import ThreadPoolExecutor, as_completed
import time
import random

def scrape_with_concurrency(
    asins: list, max_workers: int = 5, delay_range: tuple = (1.0, 3.0)
) -> list:
    """
    Scrape multiple ASINs concurrently.

    Keep max_workers moderate (5-10) to avoid detection.
    Higher concurrency is possible with a larger proxy pool.
    """
    results = []

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_asin = {}
        for asin in asins:
            # Stagger submission to avoid burst patterns
            time.sleep(random.uniform(*delay_range) / max_workers)
            future = executor.submit(scrape_product, asin)
            future_to_asin[future] = asin

        for future in as_completed(future_to_asin):
            asin = future_to_asin[future]
            try:
                result = future.result()
                results.append(result)
            except Exception as e:
                logger.error(f"Error scraping {asin}: {e}")
                results.append({"asin": asin, "status": "error", "error": str(e)})

    return results

Building a Data Pipeline

For production scraping operations, you need a proper data pipeline:

  1. Job queue: Use Redis or a database to manage a queue of ASINs or URLs to scrape. This allows you to resume after failures, prioritize certain products, and distribute work across multiple machines.
  1. Rate limiter: Implement a global rate limiter that caps total requests per second across all workers. This prevents overwhelming your proxy pool or triggering Amazon's aggregate rate limits.
  1. Result storage: Store scraped data in a structured format. For most use cases, a PostgreSQL database or structured JSON/CSV files work well. Include metadata like scrape timestamp, proxy country used, and HTTP status code.
  1. Monitoring and alerting: Track your success rate, CAPTCHA rate, and error rate in real time. A sudden spike in CAPTCHAs or 503 responses means you need to back off or adjust your approach.
  1. Scheduling: Use a task scheduler like Celery or APScheduler to run scraping jobs at regular intervals for ongoing price monitoring.

Optimizing Proxy Usage and Cost

Proxy bandwidth is typically the largest cost in a scraping operation. Optimize it by:

  • Stripping images and media: When using headless browsers, block image and media loading to reduce bandwidth consumption significantly.
  • Caching static content: Cache Amazon's CSS, JavaScript, and font files locally instead of downloading them on every request.
  • Targeting specific data: If you only need the price, consider fetching the Amazon API endpoint or a lighter page variant instead of the full product page.
  • Compressing responses: Ensure your requests include Accept-Encoding: gzip, deflate, br so Amazon sends compressed responses.

Legal and Ethical Considerations

Web scraping exists in a complex legal landscape. While we cannot provide legal advice, here are the key considerations you should be aware of.

Respect robots.txt

Amazon's robots.txt file specifies which paths automated bots are allowed and disallowed from accessing. While robots.txt is not legally binding in all jurisdictions, respecting it demonstrates good faith and ethical intent.

Terms of Service

Amazon's Terms of Service restrict automated access to the site. Violating ToS can result in account bans and, in theory, legal action. Whether ToS violations constitute a legal claim varies by jurisdiction and is an evolving area of law.

Rate Limiting and Server Impact

Regardless of legality, sending an excessive volume of requests that impacts Amazon's infrastructure is irresponsible. Always implement rate limiting in your scraper to avoid causing harm.

Data Usage

How you use scraped data matters. Using publicly available product data for competitive analysis is very different from scraping personal information or copyrighted content. Consider data protection regulations like GDPR if you are collecting any personal data.

Best Practices

  • Scrape only the data you actually need
  • Implement reasonable rate limits
  • Do not scrape personal information such as reviewer identities
  • Store and handle data responsibly
  • Consult with a legal professional if your use case involves significant commercial scale
  • Be aware of regulations in your jurisdiction, including the CFAA in the United States and equivalent laws elsewhere

Common Errors and How to Fix Them

HTTP 503 Service Unavailable

Cause: Amazon's anti-bot system has flagged your request. This is the most common scraping error.

Fix:

  • Switch to residential proxies if using data center IPs
  • Slow down your request rate
  • Rotate User-Agent strings
  • Add missing browser headers

CAPTCHA / Challenge Page Returned

Cause: Your request is suspicious but not conclusively automated. Amazon is asking for human verification.

Fix:

  • Use higher-quality residential proxies
  • Ensure your TLS fingerprint matches a real browser (consider using curl_cffi or tls-client instead of requests)
  • Add randomized delays between requests
  • Reduce concurrency

HTTP 403 Forbidden

Cause: Your IP or request has been explicitly blocked.

Fix:

  • Rotate to a new IP immediately
  • Check if your proxy provider's IPs are burned in Amazon's database
  • Verify your headers are complete and consistent

Empty or Partial Page Content

Cause: Amazon served a JavaScript-dependent page and your scraper does not execute JavaScript.

Fix:

  • Switch to Selenium or Playwright for JavaScript rendering
  • Check if the data is available through an alternative endpoint or in a