Amazon is the largest e-commerce marketplace on the planet, with over 350 million products listed across dozens of regional storefronts. For businesses that depend on competitive pricing intelligence, product research, or market analysis, Amazon's product data is an invaluable resource.
But here is the problem: Amazon invests heavily in anti-scraping technology. Naive scraping attempts get blocked within minutes, sometimes seconds. IP bans, CAPTCHAs, and behavioral analysis systems are all designed to stop automated access in its tracks.
This guide covers everything you need to know about scraping Amazon reliably in 2026, from understanding why you get blocked in the first place to building a production-grade scraper that uses proxy rotation, anti-detection techniques, and intelligent request management to gather data at scale.
Before diving into the technical details, it is worth understanding why so many businesses scrape Amazon data in the first place. The use cases are diverse, but they all share one common thread: data-driven decision making.
Pricing on Amazon changes constantly. Sellers adjust prices multiple times per day based on competition, demand, and inventory levels. If you sell on Amazon (or compete against Amazon sellers), automated price monitoring lets you:
Without automated scraping, keeping tabs on even a few hundred products becomes a full-time job.
Understanding what your competitors are doing on Amazon provides a serious strategic advantage. Scraping enables you to monitor:
For brands looking to launch new products, Amazon data is one of the best sources for market validation. You can analyze:
Customer reviews on Amazon represent millions of unfiltered opinions about products in every category imaginable. Scraping reviews enables:
Amazon does not block scrapers out of spite. There are legitimate technical and business reasons behind their anti-bot systems. Understanding these reasons helps you build scrapers that avoid triggering detection in the first place.
The simplest detection method is rate limiting. Amazon monitors the number of requests coming from each IP address. When a single IP sends hundreds of requests per minute, far exceeding what any human user would generate, it gets flagged automatically.
Rate-based detection looks at:
Modern anti-bot systems go far beyond IP-based detection. Amazon uses browser fingerprinting to identify automated traffic by analyzing dozens of technical signals:
requests library has a very different TLS fingerprint than Chrome.When Amazon suspects automated traffic but is not certain, it serves a CAPTCHA challenge page. You will recognize this as a page asking you to solve an image puzzle or type characters from a distorted image. These challenges are designed to be easy for humans and difficult for bots.
Amazon typically serves CAPTCHAs when:
Amazon maintains databases of IP reputation scores, both internally and through third-party services. Certain IP ranges are pre-flagged as high-risk:
This is precisely why proxy selection matters so much for Amazon scraping, and why residential proxies are the standard recommendation for this use case.
A proxy server acts as an intermediary between your scraper and Amazon. Instead of Amazon seeing your real IP address, it sees the IP of the proxy. This is foundational to any serious Amazon scraping operation for three reasons.
By rotating through a pool of proxy IPs, you distribute your requests across many different addresses. If you have access to a pool of 10,000 residential IPs, each IP only needs to handle a small fraction of your total request volume. Amazon sees what appears to be thousands of individual users browsing normally.
Residential proxies route your traffic through real consumer IP addresses assigned by ISPs. To Amazon's detection systems, requests from residential IPs look identical to requests from genuine shoppers. This is the single biggest advantage over data center proxies, which are flagged on sight.
Amazon operates separate storefronts for different countries: amazon.com, amazon.co.uk, amazon.de, amazon.co.jp, amazon.in, and many more. Product availability, pricing, and reviews vary by region. By using proxies located in specific countries, you can access each storefront as a local user would, collecting accurate localized data.
For example, scraping amazon.de with a US-based IP may return different results than scraping it with a German residential IP. Geo-targeting ensures data accuracy.
This section walks through the practical setup process from proxy selection to working code.
Not all proxies are created equal. Here is a breakdown of the main types and their suitability for Amazon scraping:
| Proxy Type | Success Rate on Amazon | Cost | Speed | Recommendation |
|---|---|---|---|---|
| Data center proxies | Very low (10-20%) | Low | Fast | Not recommended |
| Residential proxies | High (85-95%) | Medium | Medium | Strongly recommended |
| ISP proxies | High (80-90%) | Medium-High | Fast | Good alternative |
| Mobile proxies | Very high (90-98%) | High | Variable | Best for hardest targets |
Residential proxies are the standard choice for Amazon scraping. They offer the best balance of success rate, cost, and scalability. Data center proxies are essentially useless against Amazon's detection systems in 2026 because their IP ranges are well-known and immediately flagged.
When evaluating a residential proxy provider, look for:
SpyderProxy offers residential proxies with a pool of over 10 million IPs across 195+ countries, with both rotating and sticky session options that are well-suited for Amazon scraping at any scale.
Set up a clean Python environment for your scraper:
# Create a virtual environment
python -m venv amazon-scraper
source amazon-scraper/bin/activate # On Windows: amazon-scraper\Scripts\activate
# Install dependencies
pip install requests beautifulsoup4 lxml fake-useragent selenium
Here is how to configure proxy rotation with a residential proxy service. Most providers, including SpyderProxy, support HTTP/HTTPS proxy protocols with authentication:
import requests
from bs4 import BeautifulSoup
# SpyderProxy configuration
PROXY_HOST = "geo.spyderproxy.com"
PROXY_PORT = "11200"
PROXY_USER = "your_username"
PROXY_PASS = "your_password"
proxy_url = f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}"
proxies = {
"http": proxy_url,
"https": proxy_url,
}
# Make a request through the proxy
response = requests.get(
"https://www.amazon.com/dp/B09V3KXJPB",
proxies=proxies,
timeout=30,
)
print(f"Status: {response.status_code}")
print(f"Page length: {len(response.text)} characters")
For large-scale scraping, you need to rotate IPs automatically. Most residential proxy providers handle rotation server-side. With SpyderProxy, you can force a new IP on each request by appending a session identifier:
import random
import string
import requests
PROXY_HOST = "geo.spyderproxy.com"
PROXY_PORT = "11200"
PROXY_USER = "your_username"
PROXY_PASS = "your_password"
def get_rotating_proxy():
"""Generate a proxy URL with a random session ID to force IP rotation."""
session_id = "".join(random.choices(string.ascii_lowercase + string.digits, k=8))
proxy_url = (
f"http://{PROXY_USER}-session-{session_id}:{PROXY_PASS}"
f"@{PROXY_HOST}:{PROXY_PORT}"
)
return {"http": proxy_url, "https": proxy_url}
def get_sticky_proxy(session_id: str):
"""Use the same IP for multiple requests within a session."""
proxy_url = (
f"http://{PROXY_USER}-session-{session_id}:{PROXY_PASS}"
f"@{PROXY_HOST}:{PROXY_PORT}"
)
return {"http": proxy_url, "https": proxy_url}
Use rotating proxies for independent page fetches (product pages, search results). Use sticky sessions when you need to maintain state across multiple requests, such as navigating pagination or following a sequence of pages that Amazon expects to come from the same user.
To scrape regional Amazon stores with accurate localized data, specify the target country in your proxy configuration:
def get_geo_targeted_proxy(country_code: str):
"""
Route requests through a proxy in a specific country.
Supported codes: us, gb, de, fr, jp, in, ca, au, etc.
"""
proxy_url = (
f"http://{PROXY_USER}-country-{country_code}:{PROXY_PASS}"
f"@{PROXY_HOST}:{PROXY_PORT}"
)
return {"http": proxy_url, "https": proxy_url}
# Scrape Amazon Germany with a German residential IP
de_proxies = get_geo_targeted_proxy("de")
response = requests.get("https://www.amazon.de/dp/B09V3KXJPB", proxies=de_proxies)
# Scrape Amazon UK with a British residential IP
gb_proxies = get_geo_targeted_proxy("gb")
response = requests.get("https://www.amazon.co.uk/dp/B09V3KXJPB", proxies=gb_proxies)
# Scrape Amazon Japan with a Japanese residential IP
jp_proxies = get_geo_targeted_proxy("jp")
response = requests.get("https://www.amazon.co.jp/dp/B09V3KXJPB", proxies=jp_proxies)
This matters because Amazon tailors product availability, pricing, shipping options, and even which sellers are shown based on the geographic location of the visitor.
Using proxies is necessary but not sufficient. To maintain high success rates, you need to make your scraper's traffic indistinguishable from a real browser. Here are the techniques that matter most.
Every HTTP request includes a User-Agent header that identifies the client. Sending the same User-Agent string on every request is a clear signal of automation. Rotate through realistic, up-to-date User-Agent strings:
from fake_useragent import UserAgent
import random
ua = UserAgent(browsers=["Chrome", "Edge"])
def get_realistic_headers():
"""Generate headers that closely mimic a real browser."""
user_agent = ua.random
# Determine browser type from UA string for consistent headers
is_chrome = "Chrome" in user_agent and "Edg" not in user_agent
headers = {
"User-Agent": user_agent,
"Accept": (
"text/html,application/xhtml+xml,application/xml;"
"q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8"
),
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Cache-Control": "max-age=0",
}
if is_chrome:
headers["sec-ch-ua"] = (
'"Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"'
)
headers["sec-ch-ua-mobile"] = "?0"
headers["sec-ch-ua-platform"] = '"Windows"'
return headers
Humans do not browse at perfectly regular intervals. Adding randomized delays between requests is critical:
import time
import random
def human_delay(min_seconds=1.5, max_seconds=5.0):
"""Simulate human-like browsing delays."""
# Use a log-normal distribution for more realistic timing
delay = random.lognormvariate(0.5, 0.5)
delay = max(min_seconds, min(delay, max_seconds))
time.sleep(delay)
def cautious_delay():
"""Longer delay for use after receiving a warning signal."""
time.sleep(random.uniform(10, 30))
Do not underestimate the importance of this. Many scrapers that use good proxies still get blocked because their request timing is unnaturally uniform.
Amazon inspects the full set of HTTP headers, not just the User-Agent. The order of headers matters, and inconsistencies between the User-Agent and other headers (like sec-ch-ua) will raise flags.
Key principles:
Sec-Fetch-* headers on a Chrome User-Agent is suspicious.Some Amazon pages require JavaScript execution to load product data. Amazon also uses JavaScript-based fingerprinting to detect bots. When you encounter pages that return incomplete data or challenge pages, switch to a headless browser:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import random
def create_stealth_driver(proxy_url: str = None):
"""Create a Selenium WebDriver with anti-detection measures."""
options = Options()
# Core stealth settings
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--disable-infobars")
options.add_argument("--window-size=1920,1080")
options.add_argument("--lang=en-US")
# Proxy configuration
if proxy_url:
options.add_argument(f"--proxy-server={proxy_url}")
# Disable automation flags
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
driver = webdriver.Chrome(options=options)
# Override navigator.webdriver property
driver.execute_cdp_cmd(
"Page.addScriptToEvaluateOnNewDocument",
{
"source": """
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
// Override chrome runtime
window.chrome = { runtime: {} };
// Override permissions
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) => (
parameters.name === 'notifications'
? Promise.resolve({ state: Notification.permission })
: originalQuery(parameters)
);
"""
},
)
return driver
When scraping related pages (such as all reviews for a product, or paginated search results), you should maintain session consistency. This means using the same IP, cookies, and headers across related requests:
import requests
class AmazonSession:
"""Manage a consistent session for multi-page scraping."""
def __init__(self, proxy_session_id: str, country: str = "us"):
self.session = requests.Session()
self.proxy_session_id = proxy_session_id
# Set sticky proxy for this session
proxy_url = (
f"http://{PROXY_USER}-session-{proxy_session_id}"
f"-country-{country}:{PROXY_PASS}"
f"@{PROXY_HOST}:{PROXY_PORT}"
)
self.session.proxies = {"http": proxy_url, "https": proxy_url}
self.session.headers.update(get_realistic_headers())
def get_product_page(self, asin: str):
"""Fetch a product page."""
url = f"https://www.amazon.com/dp/{asin}"
human_delay()
return self.session.get(url, timeout=30)
def get_reviews(self, asin: str, page: int = 1):
"""Fetch product reviews with proper Referer."""
self.session.headers["Referer"] = f"https://www.amazon.com/dp/{asin}"
url = (
f"https://www.amazon.com/product-reviews/{asin}"
f"?pageNumber={page}"
)
human_delay()
return self.session.get(url, timeout=30)
def close(self):
self.session.close()
Here are two complete, working examples: one using requests + BeautifulSoup for lightweight scraping, and one using Selenium for JavaScript-heavy pages.
"""
Amazon product scraper using requests and BeautifulSoup.
Extracts product title, price, rating, and review count.
"""
import requests
from bs4 import BeautifulSoup
import random
import string
import time
import json
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Proxy configuration
PROXY_HOST = "geo.spyderproxy.com"
PROXY_PORT = "11200"
PROXY_USER = "your_username"
PROXY_PASS = "your_password"
def get_proxy():
"""Get a rotating proxy with a random session ID."""
sid = "".join(random.choices(string.ascii_lowercase + string.digits, k=8))
url = (
f"http://{PROXY_USER}-session-{sid}:{PROXY_PASS}"
f"@{PROXY_HOST}:{PROXY_PORT}"
)
return {"http": url, "https": url}
def get_headers():
"""Return realistic browser headers."""
user_agents = [
(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
" (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
),
(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
" (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
),
(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
" (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Edg/124.0.0.0"
),
]
return {
"User-Agent": random.choice(user_agents),
"Accept": (
"text/html,application/xhtml+xml,application/xml;"
"q=0.9,image/avif,image/webp,*/*;q=0.8"
),
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
}
def parse_product_page(html: str) -> dict:
"""Extract product data from an Amazon product page."""
soup = BeautifulSoup(html, "lxml")
data = {}
# Product title
title_el = soup.find("span", {"id": "productTitle"})
data["title"] = title_el.get_text(strip=True) if title_el else None
# Price
price_el = soup.find("span", {"class": "a-price-whole"})
price_frac = soup.find("span", {"class": "a-price-fraction"})
if price_el:
whole = price_el.get_text(strip=True).replace(".", "").replace(",", "")
fraction = price_frac.get_text(strip=True) if price_frac else "00"
data["price"] = f"{whole}.{fraction}"
else:
data["price"] = None
# Rating
rating_el = soup.find("span", {"class": "a-icon-alt"})
if rating_el and "out of" in rating_el.get_text():
data["rating"] = rating_el.get_text(strip=True)
else:
data["rating"] = None
# Review count
review_el = soup.find("span", {"id": "acrCustomerReviewCount"})
data["review_count"] = review_el.get_text(strip=True) if review_el else None
# Availability
avail_el = soup.find("div", {"id": "availability"})
data["availability"] = avail_el.get_text(strip=True) if avail_el else None
# ASIN
asin_el = soup.find("input", {"id": "ASIN"})
data["asin"] = asin_el["value"] if asin_el else None
return data
def scrape_product(asin: str, max_retries: int = 3) -> dict:
"""Scrape a single Amazon product page with retry logic."""
url = f"https://www.amazon.com/dp/{asin}"
for attempt in range(max_retries):
try:
proxy = get_proxy()
headers = get_headers()
response = requests.get(
url, headers=headers, proxies=proxy, timeout=30
)
if response.status_code == 200:
if "captcha" in response.text.lower():
logger.warning(
f"CAPTCHA detected on attempt {attempt + 1} for {asin}"
)
time.sleep(random.uniform(5, 15))
continue
product_data = parse_product_page(response.text)
product_data["url"] = url
product_data["status"] = "success"
logger.info(f"Successfully scraped {asin}: {product_data['title']}")
return product_data
elif response.status_code == 503:
logger.warning(f"503 response for {asin}, retrying...")
time.sleep(random.uniform(3, 10))
elif response.status_code == 404:
logger.info(f"Product {asin} not found (404)")
return {"asin": asin, "status": "not_found"}
else:
logger.warning(
f"Status {response.status_code} for {asin}"
f" on attempt {attempt + 1}"
)
except requests.exceptions.RequestException as e:
logger.error(f"Request error for {asin}: {e}")
time.sleep(random.uniform(2, 5))
return {"asin": asin, "status": "failed"}
def scrape_multiple_products(asins: list) -> list:
"""Scrape multiple products with delays between requests."""
results = []
for i, asin in enumerate(asins):
logger.info(f"Scraping product {i + 1}/{len(asins)}: {asin}")
result = scrape_product(asin)
results.append(result)
# Random delay between products
if i < len(asins) - 1:
delay = random.uniform(2.0, 6.0)
time.sleep(delay)
return results
# Usage
if __name__ == "__main__":
asins_to_scrape = [
"B09V3KXJPB",
"B0BSHF7WHW",
"B0D5926JJH",
]
results = scrape_multiple_products(asins_to_scrape)
# Save results
with open("amazon_products.json", "w") as f:
json.dump(results, f, indent=2)
# Summary
successful = sum(1 for r in results if r.get("status") == "success")
print(f"\nScraped {successful}/{len(results)} products successfully")
"""
Amazon scraper using Selenium for pages that require JavaScript rendering.
Handles dynamic content loading, infinite scroll, and CAPTCHA detection.
"""
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
import time
import random
import json
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# SpyderProxy configuration
PROXY_HOST = "geo.spyderproxy.com"
PROXY_PORT = "11200"
PROXY_USER = "your_username"
PROXY_PASS = "your_password"
def create_driver(country: str = "us"):
"""Create a stealth Chrome driver routed through SpyderProxy."""
session_id = "".join(
random.choices("abcdefghijklmnopqrstuvwxyz0123456789", k=8)
)
proxy_addr = (
f"{PROXY_HOST}:{PROXY_PORT}"
)
options = Options()
options.add_argument(f"--proxy-server=http://{proxy_addr}")
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--window-size=1920,1080")
options.add_argument("--disable-infobars")
options.add_argument("--lang=en-US")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
driver = webdriver.Chrome(options=options)
# Remove webdriver flag
driver.execute_cdp_cmd(
"Page.addScriptToEvaluateOnNewDocument",
{
"source": """
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
"""
},
)
return driver
def simulate_human_browsing(driver):
"""Simulate human-like scrolling and mouse movement."""
# Scroll down gradually
total_height = driver.execute_script("return document.body.scrollHeight")
current_position = 0
while current_position < total_height * 0.7:
scroll_amount = random.randint(200, 500)
current_position += scroll_amount
driver.execute_script(f"window.scrollTo(0, {current_position});")
time.sleep(random.uniform(0.3, 1.0))
def scrape_search_results(query: str, max_pages: int = 3) -> list:
"""Scrape Amazon search results for a given query."""
driver = create_driver()
all_products = []
try:
for page in range(1, max_pages + 1):
url = (
f"https://www.amazon.com/s?k={query.replace(' ', '+')}"
f"&page={page}"
)
logger.info(f"Scraping search page {page}: {url}")
driver.get(url)
# Wait for search results to load
try:
WebDriverWait(driver, 15).until(
EC.presence_of_element_located(
(By.CSS_SELECTOR, "[data-component-type='s-search-result']")
)
)
except TimeoutException:
logger.warning(f"Timeout waiting for results on page {page}")
# Check for CAPTCHA
if "captcha" in driver.page_source.lower():
logger.error("CAPTCHA detected. Stopping.")
break
continue
# Simulate human browsing
simulate_human_browsing(driver)
# Parse results
soup = BeautifulSoup(driver.page_source, "lxml")
results = soup.find_all(
"div", {"data-component-type": "s-search-result"}
)
for result in results:
product = {}
product["asin"] = result.get("data-asin", "")
# Title
title_el = result.find(
"span", {"class": "a-text-normal"}
)
product["title"] = (
title_el.get_text(strip=True) if title_el else None
)
# Price
price_whole = result.find("span", {"class": "a-price-whole"})
price_frac = result.find("span", {"class": "a-price-fraction"})
if price_whole:
whole = price_whole.get_text(strip=True).replace(".", "")
frac = (
price_frac.get_text(strip=True) if price_frac else "00"
)
product["price"] = f"${whole}.{frac}"
else:
product["price"] = None
# Rating
rating_el = result.find("span", {"class": "a-icon-alt"})
product["rating"] = (
rating_el.get_text(strip=True) if rating_el else None
)
# Review count
review_link = result.find(
"a", {"class": "a-link-normal s-underline-text"}
)
product["reviews"] = (
review_link.get_text(strip=True) if review_link else None
)
if product["asin"]:
all_products.append(product)
logger.info(
f"Page {page}: found {len(results)} products"
f" ({len(all_products)} total)"
)
# Delay between pages
if page < max_pages:
time.sleep(random.uniform(3, 8))
finally:
driver.quit()
return all_products
# Usage
if __name__ == "__main__":
products = scrape_search_results("wireless headphones", max_pages=3)
with open("search_results.json", "w") as f:
json.dump(products, f, indent=2)
print(f"Scraped {len(products)} products total")
Once your scraper works reliably for small batches, you will eventually need to scale it to handle thousands or millions of product pages. Here is how to approach that.
Use Python's concurrent.futures module to run multiple requests in parallel while respecting rate limits:
from concurrent.futures import ThreadPoolExecutor, as_completed
import time
import random
def scrape_with_concurrency(
asins: list, max_workers: int = 5, delay_range: tuple = (1.0, 3.0)
) -> list:
"""
Scrape multiple ASINs concurrently.
Keep max_workers moderate (5-10) to avoid detection.
Higher concurrency is possible with a larger proxy pool.
"""
results = []
with ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_asin = {}
for asin in asins:
# Stagger submission to avoid burst patterns
time.sleep(random.uniform(*delay_range) / max_workers)
future = executor.submit(scrape_product, asin)
future_to_asin[future] = asin
for future in as_completed(future_to_asin):
asin = future_to_asin[future]
try:
result = future.result()
results.append(result)
except Exception as e:
logger.error(f"Error scraping {asin}: {e}")
results.append({"asin": asin, "status": "error", "error": str(e)})
return results
For production scraping operations, you need a proper data pipeline:
Proxy bandwidth is typically the largest cost in a scraping operation. Optimize it by:
Accept-Encoding: gzip, deflate, br so Amazon sends compressed responses.Web scraping exists in a complex legal landscape. While we cannot provide legal advice, here are the key considerations you should be aware of.
Amazon's robots.txt file specifies which paths automated bots are allowed and disallowed from accessing. While robots.txt is not legally binding in all jurisdictions, respecting it demonstrates good faith and ethical intent.
Amazon's Terms of Service restrict automated access to the site. Violating ToS can result in account bans and, in theory, legal action. Whether ToS violations constitute a legal claim varies by jurisdiction and is an evolving area of law.
Regardless of legality, sending an excessive volume of requests that impacts Amazon's infrastructure is irresponsible. Always implement rate limiting in your scraper to avoid causing harm.
How you use scraped data matters. Using publicly available product data for competitive analysis is very different from scraping personal information or copyrighted content. Consider data protection regulations like GDPR if you are collecting any personal data.
Cause: Amazon's anti-bot system has flagged your request. This is the most common scraping error.
Fix:
Cause: Your request is suspicious but not conclusively automated. Amazon is asking for human verification.
Fix:
curl_cffi or tls-client instead of requests)Cause: Your IP or request has been explicitly blocked.
Fix:
Cause: Amazon served a JavaScript-dependent page and your scraper does not execute JavaScript.
Fix:
tag within the initial HTMLCause: The proxy server is slow or unresponsive.
Fix:
Cause: Amazon requires authentication for the content you are trying to access.
Fix:
The legality of web scraping depends on your jurisdiction, the type of data you collect, and how you use it. Scraping publicly available data for competitive intelligence is generally treated differently than scraping personal information. The legal landscape continues to evolve, with court rulings in various jurisdictions addressing different aspects of automated data collection. Consult a legal professional for guidance on your specific use case.
Technically you can try, but free proxies have extremely low success rates against Amazon. Free proxy lists consist primarily of data center IPs that are already flagged, shared among thousands of users, and unreliable. For any serious scraping operation, paid residential proxies are a requirement.
There is no magic number, but a general guideline is to keep individual IP request rates below 1 request every 5-10 seconds. With a large enough pool of rotating residential IPs, your aggregate throughput can be much higher because each IP stays under the threshold. SpyderProxy's residential pool lets you maintain high aggregate throughput while keeping per-IP rates safe.
For most Amazon product pages, HTTP requests with requests or httpx are sufficient and significantly faster. You should only use headless browsers (Selenium, Playwright) when you encounter JavaScript-rendered content that is not available in the initial HTML response, or when you need to solve JavaScript challenges. Start with simple HTTP requests and escalate to headless browsers only when needed.
Residential proxies are the best choice for Amazon scraping in 2026. They use real consumer IP addresses that Amazon's systems treat as legitimate traffic. Data center proxies are detected almost immediately, and while mobile proxies have slightly higher success rates than residential, they are typically more expensive and not necessary for Amazon specifically.
Use geo-targeted proxies that route your requests through IPs in the target country. Scraping amazon.de with a German IP, amazon.co.jp with a Japanese IP, and so on, ensures you receive the same localized content that real users in those countries see. This is important for accurate pricing and availability data. SpyderProxy supports geo-targeting at the country level for 195+ countries, making it straightforward to collect localized data from any Amazon storefront.
Costs depend on volume and proxy quality. Residential proxy bandwidth typically runs between $2 and $15 per GB depending on the provider and plan. A typical Amazon product page weighs 300-500 KB, so scraping 10,000 products costs roughly 3-5 GB of bandwidth. The exact cost depends on your proxy provider's pricing tiers, whether you are loading images, and how many retries you need.
Yes, Amazon can detect default Selenium and Playwright configurations through JavaScript fingerprinting. Automated browsers expose properties like navigator.webdriver, specific window.chrome properties, and other markers that anti-bot systems check. You need to use stealth plugins and configurations (as shown in the code examples above) to mask these signals. Even with stealth measures, headless browser traffic is generally easier to fingerprint than well-configured HTTP requests.
Scraping Amazon reliably in 2026 requires a multi-layered approach. Proxies alone are not enough, and anti-detection tricks alone are not enough. You need both working together: high-quality residential proxies to handle IP reputation and rate limiting, combined with careful attention to browser fingerprinting, request timing, and session management.
To recap the essential components:
If you are building or scaling an Amazon scraping operation and need a proxy provider that is built for this workload, SpyderProxy offers residential proxies with 10M+ IPs, granular country targeting, flexible rotation options, and pay-per-GB pricing that keeps costs predictable as you scale. Start with a trial to test your success rates before committing to a plan.
This guide is provided for educational purposes. Always ensure your scraping activities comply with applicable laws and terms of service in your jurisdiction.