Scraping Walmart means programmatically collecting data from Walmart.com — product titles, prices, availability, seller IDs, image URLs, ratings, reviews, and category hierarchies — by issuing HTTP requests to Walmart pages and parsing the returned HTML and JSON. Walmart is the second-largest e-commerce dataset in the US after Amazon, carrying over 400 million products across 35 categories in 2026. That makes it a top target for price intelligence, brand protection, arbitrage research, and catalog enrichment teams.
Walmart is also one of the most aggressively defended retail sites on the public web. It uses PerimeterX (now HUMAN Security Bot Defender) as its primary anti-bot layer, combined with Akamai at the edge and behavioral scoring on top. Naive scrapers get challenged within the first 10-20 requests. This guide walks through the actual 2026 pipeline: which proxy type works, how PerimeterX detects you, how to either bypass or solve its challenges, and realistic rate limits that keep block rate under 5%.
Scraping publicly visible pages on Walmart.com is generally legal in most jurisdictions, based on the same precedent line (hiQ Labs v. LinkedIn in the US) that protects scraping of public web content. You are accessing pages that any browser can load without authentication. What you do with the scraped data afterward — resale of a derivative dataset, republication, using it in a commercial product — may involve additional considerations around copyright and the Computer Fraud and Abuse Act if you cross boundaries like bypassing a paywall or a technical access control you have no right to circumvent.
Walmart's Terms of Use prohibit automated access. Violating the ToU is generally a civil matter — Walmart can block your IPs or accounts — not a criminal one. Keep scraping unauthenticated, respect robots.txt where it matters to you, avoid scraping checkout flows, and don't scrape personal data. If your use case is commercial price intelligence or brand protection on products you sell, that is an extremely well-established use case that operates daily at enterprise scale.
Walmart offers an official Marketplace API and Affiliate API, but both are gated. The Marketplace API is only available to approved third-party sellers and returns data about your own listings. The Affiliate API (Impact Radius/Impact.com-powered) is for partners publishing Walmart product links and limits data depth. Neither API gives open access to the full public catalog.
For anyone doing competitive pricing, arbitrage research, brand protection, or public catalog enrichment, scraping is the practical path. The data you can extract from Walmart's public pages — pricing, availability, reviews, Q&A, seller metadata, specifications — is not accessible through any open API.
Walmart product pages and search pages are information-dense. The fields most commonly extracted in production:
For price monitoring workflows, the critical fields are USItemID, price, was-price, seller ID, and availability. Those four data points, collected hourly or daily across a product list, drive competitive intelligence, MAP (Minimum Advertised Price) monitoring, and arbitrage scouting.
Walmart.com's primary anti-bot vendor in 2026 is HUMAN Security (formerly PerimeterX, still commonly referenced by that name in developer circles). HUMAN's Bot Defender sits in front of Walmart's origin servers and scores every incoming request on a stack of signals.
On first visit, PerimeterX drops a _px3 cookie (and related _pxhd, _pxvid cookies). These cookies encode a device token derived from extensive browser fingerprinting: Canvas fingerprint, WebGL renderer, installed fonts, timezone, navigator properties, WebRTC IPs, battery status, plugins, screen dimensions, audio fingerprint, and more. The token is re-verified on subsequent requests by posting a sensor payload to https://collector-pxajd1m4oa.px-cloud.net/api/v2/collector (the endpoint hostname varies per customer).
When PerimeterX can't verify the device token or the score drops below a threshold, Walmart serves its signature interstitial: a captcha that requires the user to press and hold a button. Internally this is a proof-of-work + mouse-movement verification. Bots without real mouse dynamics fail this challenge; so do HTTP-only clients, because the challenge requires JavaScript execution.
PerimeterX and Akamai both read your TLS JA3/JA4 hash and HTTP/2 SETTINGS frame order. Python requests, plain urllib, and older versions of aiohttp all have easily-identifiable fingerprints that no real browser emits. Mismatches trigger immediate challenges.
Request cadence (too regular = bot), navigation paths (skipping category pages = bot), referer consistency (no referer = bot), and session longevity all feed into a running score. Scrapers that fire sequential product-page requests with no browsing context look nothing like a real shopper and are flagged within 10-20 requests.
Walmart maintains its own IP reputation feed plus consumes third-party feeds. Any IP from AWS, GCP, Azure, DigitalOcean, Hetzner, or common VPN providers is pre-flagged. Datacenter IPs are effectively blocked from the start — not because of request volume, but because the IP belongs to an infrastructure provider.
Proxy choice is the single biggest determinant of success on Walmart. Because PerimeterX weights IP reputation heavily, the proxy must look like a real consumer ISP endpoint — anything else fails before PerimeterX even evaluates the browser fingerprint.
Residential proxies route traffic through real consumer home connections on Comcast, Spectrum, Verizon, AT&T, Charter, and similar ISPs. PerimeterX treats these IPs as ordinary shoppers because that is what they are.
SpyderProxy's Budget Residential plan at $1.75/GB works for most Walmart workloads with a 10M+ residential IP pool. For higher-volume scraping or workloads that push detection, Premium Residential at $2.75/GB with 130M+ IPs provides better pool freshness — the IPs you get have seen less scraping traffic and rank higher in HUMAN's reputation scoring.
If residential proxies start to show sustained block rates despite a clean stack, LTE proxies at $2/IP are the next escalation. Mobile carriers use CGNAT — one public IP per thousands of real phone users — making IP-level banning commercially unattractive for the target. Mobile IPs have the highest trust scores of any proxy type on PerimeterX-defended sites.
Datacenter proxies are detected by HUMAN within one or two requests. They are useful for other sites but not for Walmart. ISP proxies ($3.90/day) can work for small-volume manual rechecks, but at scale they are worse than rotating residential because the same IP making many requests accumulates reputation debt faster than a rotating pool.
There are three practical paths to defeating the PerimeterX / HUMAN Bot Defender layer. Pick one based on your volume, latency tolerance, and engineering effort.
Run a real Chromium browser via Playwright or Puppeteer with a stealth plugin (e.g. playwright-extra + puppeteer-extra-plugin-stealth). The browser natively executes PerimeterX's sensor JavaScript, produces a legitimate _px3 cookie, and passes the Press & Hold challenge when it appears (either by letting a human solve it occasionally or by integrating a solver service for the press-and-hold interaction).
Pros: highest success rate, handles DOM changes automatically, works for pages that require JavaScript rendering. Cons: 10-20x higher cost per request (both in proxy bandwidth — 2-5 MB per full page load — and in CPU/memory). Use this path for low-to-mid volume, say under 50,000 pages per day.
Use curl_cffi in Python, which wraps libcurl with real browser TLS fingerprints. Set impersonate="chrome124" or a current Chrome version. Combine with a real-looking header set (20+ headers in correct order, Accept-Language, Sec-CH-UA-* client hints, valid Referer), rotating residential proxy per request, and a pre-warmed _px3 cookie extracted from a periodic browser session.
This path is cheapest at scale (around 200-400 KB per page vs 2-5 MB for browser automation) but requires you to periodically refresh the _px3 cookie because it ages out. Run a headless Chrome every 30-60 minutes to harvest a fresh cookie, then reuse it across many HTTP-only requests. Works well for 100k-1M+ pages per day.
Services like Capsolver, 2Captcha, NopeCHA, and ScrapFly offer hosted PerimeterX solvers. You send the target URL and your proxy credentials; they return a validated _px3 cookie (or a full fetched response body) that passes PerimeterX scoring. Pricing ranges from $0.50 to $2 per 1,000 solves.
Pros: zero stealth engineering on your side, rapid to integrate. Cons: cost scales linearly, vendor lock-in, solver quality drifts when HUMAN ships detection updates and the solver hasn't caught up. Recommended for teams that want time-to-data more than cost efficiency, or as a fallback for the 5-10% of requests that your primary stack can't solve.
| Path | Proxy | Success Rate | Cost per 1k pages | Best For |
|---|---|---|---|---|
| Browser + stealth | Residential | 85-95% | $3.50-10.00 | Low-to-mid volume, any page |
| HTTP + curl_cffi + warm cookie | Residential | 75-90% | $0.40-0.80 | High volume, product pages |
| Hosted solver (Capsolver etc.) | Your residential or theirs | 90-95% | $1.50-3.00 | Fallback / no-engineering teams |
| Datacenter proxy only | Datacenter | 0-10% | n/a | Do not use on Walmart |
Walmart doesn't publish a scraper rate limit; the effective limit is what PerimeterX tolerates before raising your risk score. Observed working parameters in 2026:
_px3 cookie context.Add random jitter (0.5-2.5 seconds) to each request interval. Deterministic timing is a bot fingerprint on its own — HUMAN's behavioral scoring catches regular request intervals even when every other signal is clean.
A working Walmart product-page scraper using curl_cffi for TLS fingerprint spoofing, rotating residential proxies, and a pre-warmed _px3 cookie. The cookie-refresh loop is omitted for brevity but runs every 30-60 minutes from a separate Playwright worker.
import random, time, re, json
from curl_cffi import requests
from bs4 import BeautifulSoup
PROXY = "http://username-session-{sid}:[email protected]:7777"
# _px3 cookie harvested by a separate Playwright session; rotate hourly.
PX_COOKIE = "REPLACE_WITH_FRESH_PX3_VALUE"
def fetch_product(usitemid, session_id=None):
sid = session_id or random.randint(10000, 99999)
proxy = PROXY.format(sid=sid)
url = f"https://www.walmart.com/ip/{usitemid}"
cookies = {"_px3": PX_COOKIE}
r = requests.get(
url,
impersonate="chrome124",
proxies={"http": proxy, "https": proxy},
cookies=cookies,
timeout=30,
)
r.raise_for_status()
return r.text
def parse_product(html):
soup = BeautifulSoup(html, "html.parser")
# Walmart embeds product data in a __NEXT_DATA__ JSON blob.
script = soup.find("script", id="__NEXT_DATA__")
if not script:
return None
blob = json.loads(script.string)
product = blob["props"]["pageProps"]["initialData"]["data"]["product"]
return {
"usItemId": product.get("usItemId"),
"title": product.get("name"),
"brand": product.get("brand"),
"price": product["priceInfo"]["currentPrice"]["price"],
"was_price": product["priceInfo"].get("wasPrice", {}).get("price"),
"in_stock": product["availabilityStatus"] == "IN_STOCK",
"seller_id": product.get("sellerId"),
"rating": product.get("averageRating"),
"review_count": product.get("numberOfReviews"),
}
if __name__ == "__main__":
# Example product IDs
for item_id in ["1234567890", "0987654321"]:
html = fetch_product(item_id)
row = parse_product(html)
print(row)
time.sleep(random.uniform(3.0, 5.0))
Key details: impersonate="chrome124" spoofs the Chrome 124 TLS fingerprint; the _px3 cookie must come from a real browser session harvested within the last ~30 minutes; the parser reads Walmart's __NEXT_DATA__ JSON blob rather than fragile DOM selectors — this is the single most stable source of product data. See our guide on rotating proxies with Python requests for deeper session-management patterns.
PerimeterX returns HTTP 200 with a challenge page, not a 4xx. Detect it by checking if the response HTML contains _pxCaptcha, px-captcha, or the telltale hostname captcha.px-cdn.net. When detected, mark that proxy session as burned, rotate to a new session ID, refresh the _px3 cookie, and retry.
Back off with exponential delay (start at 60s, double on each consecutive 429). If one IP consistently 429s, the subnet might be partially burned — rotate through a wider pool or temporarily move to mobile proxies until reputation recovers.
Walmart occasionally A/B-tests DOM changes. Always validate your parser against a known-good item ID on startup and when block rate rises. When __NEXT_DATA__ shape changes, it is a Walmart change, not a PerimeterX challenge — update your parser, not your proxy stack.
PerimeterX can also degrade your experience rather than block outright — returning older cached prices, omitting seller data, or truncating reviews. Validate scraped fields against expected types and ranges; if price is missing or zero on items that should have prices, treat as a soft throttle and re-fetch from a different session.
Run a small pool (3-10) of Playwright-based cookie harvesters. Each loads Walmart.com, waits for the _px3 cookie to set, extracts it, and publishes to a shared Redis key. HTTP workers pull the current cookie from Redis on every request. Rotate harvesters every 30-60 minutes.
Pin each proxy session ID to its harvested _px3 cookie — don't mix cookies across proxy IPs or PerimeterX scoring degrades quickly. One IP, one fingerprint, one cookie, one session's worth of requests, then rotate.
Decouple URL generation, fetching, parsing, and storage via a queue (Redis streams, SQS, RabbitMQ). Workers pull from the queue, fetch through their pinned proxy session, push parsed data to a write queue, and failed URLs into a retry queue with exponential backoff.
Track block rate (challenges / total requests), parse rate (parse successes / fetch successes), and cost per 1k pages. A block rate over 10% usually means your cookie harvester is stale, your User-Agent pool is fingerprinted, or HUMAN shipped a detection update.
Walmart prices change throughout the day but most items are stable hour-to-hour. Hash the core fields (price, availability, seller) per product per fetch and skip writes when unchanged. This keeps your dataset clean and storage costs low.
Scraping publicly visible pages on Walmart.com is generally legal in most jurisdictions, based on the same precedents (hiQ Labs v. LinkedIn in the US) that protect public web scraping. Walmart's Terms of Use prohibit automated access, so scraping violates ToU and can lead to IP blocks, but typically carries no criminal risk for publicly-accessible data.
PerimeterX (now HUMAN Security Bot Defender) is the primary anti-bot layer Walmart uses. It drops a _px3 cookie on your first visit, fingerprints your browser via Canvas, WebGL, fonts, and sensor-data posts, and serves a "Press & Hold" captcha when scoring drops below a threshold. Any HTTP-only scraper without TLS fingerprint spoofing and a valid _px3 cookie is flagged within 10-20 requests.
Rotating residential proxies are required. SpyderProxy Budget Residential at $1.75/GB with a 10M+ residential pool handles most Walmart workloads. For higher-volume scraping where detection is tight, upgrade to Premium Residential at $2.75/GB with 130M+ IPs, or escalate to LTE mobile proxies at $2/IP for the hardest cases. Datacenter proxies do not work on Walmart.
Three practical paths: (1) run a real browser via Playwright or Puppeteer with a stealth plugin — highest success rate but 10-20x cost per request; (2) use curl_cffi with a warmed _px3 cookie harvested from a periodic browser session — cheapest at scale; (3) use a hosted PerimeterX solver service like Capsolver, 2Captcha, or ScrapFly at $0.50-$2 per 1,000 solves as a fallback.
With an HTTP-based stack (curl_cffi + rotating residential + warmed cookie), 30,000-150,000 Walmart pages per day per machine is achievable at sub-5% block rates. With full browser automation via Playwright, 5,000-20,000 pages per day per machine due to 10-20x higher cost per request.
Walmart embeds full product data in a JSON blob inside a script tag with id="__NEXT_DATA__" on every product page. Parse that JSON rather than fragile DOM selectors — the JSON shape is far more stable than the rendered markup and gives you richer fields (seller IDs, inventory, review metadata, category breadcrumbs).
Yes. Walmart serves different prices, delivery options, and store-pickup availability per ZIP code. Set the ZIP via the location-setting endpoint or by carrying a pre-set location cookie across your session. Pair with city-level residential proxies matching the ZIP's state or metro for the most accurate pricing — otherwise Walmart may serve a different region's default prices. See our guide on best USA proxies for state-level targeting.