spyderproxy
BackBack to Blog

How to Scrape Walmart (2026): Bypass PerimeterX & Extract Product Data

DateApr 17, 2026
By Daniel K.12 min read

Scraping Walmart means programmatically collecting data from Walmart.com — product titles, prices, availability, seller IDs, image URLs, ratings, reviews, and category hierarchies — by issuing HTTP requests to Walmart pages and parsing the returned HTML and JSON. Walmart is the second-largest e-commerce dataset in the US after Amazon, carrying over 400 million products across 35 categories in 2026. That makes it a top target for price intelligence, brand protection, arbitrage research, and catalog enrichment teams.

Walmart is also one of the most aggressively defended retail sites on the public web. It uses PerimeterX (now HUMAN Security Bot Defender) as its primary anti-bot layer, combined with Akamai at the edge and behavioral scoring on top. Naive scrapers get challenged within the first 10-20 requests. This guide walks through the actual 2026 pipeline: which proxy type works, how PerimeterX detects you, how to either bypass or solve its challenges, and realistic rate limits that keep block rate under 5%.

Scraping publicly visible pages on Walmart.com is generally legal in most jurisdictions, based on the same precedent line (hiQ Labs v. LinkedIn in the US) that protects scraping of public web content. You are accessing pages that any browser can load without authentication. What you do with the scraped data afterward — resale of a derivative dataset, republication, using it in a commercial product — may involve additional considerations around copyright and the Computer Fraud and Abuse Act if you cross boundaries like bypassing a paywall or a technical access control you have no right to circumvent.

Walmart's Terms of Use prohibit automated access. Violating the ToU is generally a civil matter — Walmart can block your IPs or accounts — not a criminal one. Keep scraping unauthenticated, respect robots.txt where it matters to you, avoid scraping checkout flows, and don't scrape personal data. If your use case is commercial price intelligence or brand protection on products you sell, that is an extremely well-established use case that operates daily at enterprise scale.

Walmart API vs Web Scraping

Walmart offers an official Marketplace API and Affiliate API, but both are gated. The Marketplace API is only available to approved third-party sellers and returns data about your own listings. The Affiliate API (Impact Radius/Impact.com-powered) is for partners publishing Walmart product links and limits data depth. Neither API gives open access to the full public catalog.

For anyone doing competitive pricing, arbitrage research, brand protection, or public catalog enrichment, scraping is the practical path. The data you can extract from Walmart's public pages — pricing, availability, reviews, Q&A, seller metadata, specifications — is not accessible through any open API.

What Data You Can Scrape from Walmart

Walmart product pages and search pages are information-dense. The fields most commonly extracted in production:

  • Product core: title, product ID (USItemID), UPC, brand, model, price, was-price, discount percentage, in-stock status.
  • Seller data: sold-by (Walmart vs. third-party marketplace seller), seller ID, seller rating, seller reviews count.
  • Inventory signals: in-stock, low-stock warnings ("Only 3 left"), delivery availability by ZIP, store pickup availability.
  • Reviews and ratings: average rating, number of reviews, full review text, reviewer metadata, verified-purchase flag, helpful votes.
  • Q&A section: customer questions and answers.
  • Specifications: structured attribute/value pairs (brand, model number, dimensions, weight, materials).
  • Images: gallery image URLs, alt text.
  • Category hierarchy: breadcrumb path from homepage to product.
  • Related products: "Similar items", "Customers also viewed", "Sponsored" recommendations.

For price monitoring workflows, the critical fields are USItemID, price, was-price, seller ID, and availability. Those four data points, collected hourly or daily across a product list, drive competitive intelligence, MAP (Minimum Advertised Price) monitoring, and arbitrage scouting.

Why Walmart Blocks Scrapers (The PerimeterX Layer)

Walmart.com's primary anti-bot vendor in 2026 is HUMAN Security (formerly PerimeterX, still commonly referenced by that name in developer circles). HUMAN's Bot Defender sits in front of Walmart's origin servers and scores every incoming request on a stack of signals.

The _px Cookie and Sensor Data

On first visit, PerimeterX drops a _px3 cookie (and related _pxhd, _pxvid cookies). These cookies encode a device token derived from extensive browser fingerprinting: Canvas fingerprint, WebGL renderer, installed fonts, timezone, navigator properties, WebRTC IPs, battery status, plugins, screen dimensions, audio fingerprint, and more. The token is re-verified on subsequent requests by posting a sensor payload to https://collector-pxajd1m4oa.px-cloud.net/api/v2/collector (the endpoint hostname varies per customer).

The "Press & Hold" Challenge

When PerimeterX can't verify the device token or the score drops below a threshold, Walmart serves its signature interstitial: a captcha that requires the user to press and hold a button. Internally this is a proof-of-work + mouse-movement verification. Bots without real mouse dynamics fail this challenge; so do HTTP-only clients, because the challenge requires JavaScript execution.

TLS and HTTP/2 Fingerprints

PerimeterX and Akamai both read your TLS JA3/JA4 hash and HTTP/2 SETTINGS frame order. Python requests, plain urllib, and older versions of aiohttp all have easily-identifiable fingerprints that no real browser emits. Mismatches trigger immediate challenges.

Behavioral Scoring

Request cadence (too regular = bot), navigation paths (skipping category pages = bot), referer consistency (no referer = bot), and session longevity all feed into a running score. Scrapers that fire sequential product-page requests with no browsing context look nothing like a real shopper and are flagged within 10-20 requests.

IP Reputation

Walmart maintains its own IP reputation feed plus consumes third-party feeds. Any IP from AWS, GCP, Azure, DigitalOcean, Hetzner, or common VPN providers is pre-flagged. Datacenter IPs are effectively blocked from the start — not because of request volume, but because the IP belongs to an infrastructure provider.

Best Proxies for Scraping Walmart

Proxy choice is the single biggest determinant of success on Walmart. Because PerimeterX weights IP reputation heavily, the proxy must look like a real consumer ISP endpoint — anything else fails before PerimeterX even evaluates the browser fingerprint.

Rotating Residential Proxies (Required)

Residential proxies route traffic through real consumer home connections on Comcast, Spectrum, Verizon, AT&T, Charter, and similar ISPs. PerimeterX treats these IPs as ordinary shoppers because that is what they are.

SpyderProxy's Budget Residential plan at $1.75/GB works for most Walmart workloads with a 10M+ residential IP pool. For higher-volume scraping or workloads that push detection, Premium Residential at $2.75/GB with 130M+ IPs provides better pool freshness — the IPs you get have seen less scraping traffic and rank higher in HUMAN's reputation scoring.

Mobile (LTE) Proxies for the Hardest Cases

If residential proxies start to show sustained block rates despite a clean stack, LTE proxies at $2/IP are the next escalation. Mobile carriers use CGNAT — one public IP per thousands of real phone users — making IP-level banning commercially unattractive for the target. Mobile IPs have the highest trust scores of any proxy type on PerimeterX-defended sites.

Avoid Datacenter and Most ISP Proxies

Datacenter proxies are detected by HUMAN within one or two requests. They are useful for other sites but not for Walmart. ISP proxies ($3.90/day) can work for small-volume manual rechecks, but at scale they are worse than rotating residential because the same IP making many requests accumulates reputation debt faster than a rotating pool.

How to Bypass PerimeterX on Walmart (2026)

There are three practical paths to defeating the PerimeterX / HUMAN Bot Defender layer. Pick one based on your volume, latency tolerance, and engineering effort.

Path 1: Browser Automation with Stealth (Most Reliable)

Run a real Chromium browser via Playwright or Puppeteer with a stealth plugin (e.g. playwright-extra + puppeteer-extra-plugin-stealth). The browser natively executes PerimeterX's sensor JavaScript, produces a legitimate _px3 cookie, and passes the Press & Hold challenge when it appears (either by letting a human solve it occasionally or by integrating a solver service for the press-and-hold interaction).

Pros: highest success rate, handles DOM changes automatically, works for pages that require JavaScript rendering. Cons: 10-20x higher cost per request (both in proxy bandwidth — 2-5 MB per full page load — and in CPU/memory). Use this path for low-to-mid volume, say under 50,000 pages per day.

Path 2: HTTP with TLS Fingerprint Spoofing (Cheapest at Scale)

Use curl_cffi in Python, which wraps libcurl with real browser TLS fingerprints. Set impersonate="chrome124" or a current Chrome version. Combine with a real-looking header set (20+ headers in correct order, Accept-Language, Sec-CH-UA-* client hints, valid Referer), rotating residential proxy per request, and a pre-warmed _px3 cookie extracted from a periodic browser session.

This path is cheapest at scale (around 200-400 KB per page vs 2-5 MB for browser automation) but requires you to periodically refresh the _px3 cookie because it ages out. Run a headless Chrome every 30-60 minutes to harvest a fresh cookie, then reuse it across many HTTP-only requests. Works well for 100k-1M+ pages per day.

Path 3: Hosted PerimeterX Solver Services

Services like Capsolver, 2Captcha, NopeCHA, and ScrapFly offer hosted PerimeterX solvers. You send the target URL and your proxy credentials; they return a validated _px3 cookie (or a full fetched response body) that passes PerimeterX scoring. Pricing ranges from $0.50 to $2 per 1,000 solves.

Pros: zero stealth engineering on your side, rapid to integrate. Cons: cost scales linearly, vendor lock-in, solver quality drifts when HUMAN ships detection updates and the solver hasn't caught up. Recommended for teams that want time-to-data more than cost efficiency, or as a fallback for the 5-10% of requests that your primary stack can't solve.

Realistic Success Rates by Path

PathProxySuccess RateCost per 1k pagesBest For
Browser + stealthResidential85-95%$3.50-10.00Low-to-mid volume, any page
HTTP + curl_cffi + warm cookieResidential75-90%$0.40-0.80High volume, product pages
Hosted solver (Capsolver etc.)Your residential or theirs90-95%$1.50-3.00Fallback / no-engineering teams
Datacenter proxy onlyDatacenter0-10%n/aDo not use on Walmart

Realistic Rate Limits for Walmart

Walmart doesn't publish a scraper rate limit; the effective limit is what PerimeterX tolerates before raising your risk score. Observed working parameters in 2026:

  • Per residential IP: 1 request every 3-5 seconds. Faster than that and PerimeterX flags even legitimate-looking traffic.
  • Per residential session: 15-40 sequential product pages before rotating to a new IP. Longer sessions accumulate too many "shopping" signals.
  • Concurrency: 20-80 concurrent workers, each with its own session ID on the proxy and its own _px3 cookie context.
  • Daily volume: 30,000-150,000 pages per day per machine with a clean HTTP-based stack; 5,000-20,000 with browser automation.

Add random jitter (0.5-2.5 seconds) to each request interval. Deterministic timing is a bot fingerprint on its own — HUMAN's behavioral scoring catches regular request intervals even when every other signal is clean.

Minimal Walmart Scraper (HTTP Path, 2026)

A working Walmart product-page scraper using curl_cffi for TLS fingerprint spoofing, rotating residential proxies, and a pre-warmed _px3 cookie. The cookie-refresh loop is omitted for brevity but runs every 30-60 minutes from a separate Playwright worker.

import random, time, re, json
from curl_cffi import requests
from bs4 import BeautifulSoup

PROXY = "http://username-session-{sid}:[email protected]:7777"

# _px3 cookie harvested by a separate Playwright session; rotate hourly.
PX_COOKIE = "REPLACE_WITH_FRESH_PX3_VALUE"

def fetch_product(usitemid, session_id=None):
    sid = session_id or random.randint(10000, 99999)
    proxy = PROXY.format(sid=sid)
    url = f"https://www.walmart.com/ip/{usitemid}"
    cookies = {"_px3": PX_COOKIE}
    r = requests.get(
        url,
        impersonate="chrome124",
        proxies={"http": proxy, "https": proxy},
        cookies=cookies,
        timeout=30,
    )
    r.raise_for_status()
    return r.text

def parse_product(html):
    soup = BeautifulSoup(html, "html.parser")
    # Walmart embeds product data in a __NEXT_DATA__ JSON blob.
    script = soup.find("script", id="__NEXT_DATA__")
    if not script:
        return None
    blob = json.loads(script.string)
    product = blob["props"]["pageProps"]["initialData"]["data"]["product"]
    return {
        "usItemId":    product.get("usItemId"),
        "title":       product.get("name"),
        "brand":       product.get("brand"),
        "price":       product["priceInfo"]["currentPrice"]["price"],
        "was_price":   product["priceInfo"].get("wasPrice", {}).get("price"),
        "in_stock":    product["availabilityStatus"] == "IN_STOCK",
        "seller_id":   product.get("sellerId"),
        "rating":      product.get("averageRating"),
        "review_count": product.get("numberOfReviews"),
    }

if __name__ == "__main__":
    # Example product IDs
    for item_id in ["1234567890", "0987654321"]:
        html = fetch_product(item_id)
        row = parse_product(html)
        print(row)
        time.sleep(random.uniform(3.0, 5.0))

Key details: impersonate="chrome124" spoofs the Chrome 124 TLS fingerprint; the _px3 cookie must come from a real browser session harvested within the last ~30 minutes; the parser reads Walmart's __NEXT_DATA__ JSON blob rather than fragile DOM selectors — this is the single most stable source of product data. See our guide on rotating proxies with Python requests for deeper session-management patterns.

Handling PerimeterX Challenges and Failures

Press & Hold Interstitial (HTTP 200 + Challenge HTML)

PerimeterX returns HTTP 200 with a challenge page, not a 4xx. Detect it by checking if the response HTML contains _pxCaptcha, px-captcha, or the telltale hostname captcha.px-cdn.net. When detected, mark that proxy session as burned, rotate to a new session ID, refresh the _px3 cookie, and retry.

HTTP 429 Rate Limit

Back off with exponential delay (start at 60s, double on each consecutive 429). If one IP consistently 429s, the subnet might be partially burned — rotate through a wider pool or temporarily move to mobile proxies until reputation recovers.

Empty or Partial __NEXT_DATA__

Walmart occasionally A/B-tests DOM changes. Always validate your parser against a known-good item ID on startup and when block rate rises. When __NEXT_DATA__ shape changes, it is a Walmart change, not a PerimeterX challenge — update your parser, not your proxy stack.

Soft Throttles

PerimeterX can also degrade your experience rather than block outright — returning older cached prices, omitting seller data, or truncating reviews. Validate scraped fields against expected types and ranges; if price is missing or zero on items that should have prices, treat as a soft throttle and re-fetch from a different session.

Scaling Walmart Scraping to Production

1. Cookie-Harvester Workers

Run a small pool (3-10) of Playwright-based cookie harvesters. Each loads Walmart.com, waits for the _px3 cookie to set, extracts it, and publishes to a shared Redis key. HTTP workers pull the current cookie from Redis on every request. Rotate harvesters every 30-60 minutes.

2. Session Pinning

Pin each proxy session ID to its harvested _px3 cookie — don't mix cookies across proxy IPs or PerimeterX scoring degrades quickly. One IP, one fingerprint, one cookie, one session's worth of requests, then rotate.

3. Queue Architecture

Decouple URL generation, fetching, parsing, and storage via a queue (Redis streams, SQS, RabbitMQ). Workers pull from the queue, fetch through their pinned proxy session, push parsed data to a write queue, and failed URLs into a retry queue with exponential backoff.

4. Monitoring

Track block rate (challenges / total requests), parse rate (parse successes / fetch successes), and cost per 1k pages. A block rate over 10% usually means your cookie harvester is stale, your User-Agent pool is fingerprinted, or HUMAN shipped a detection update.

5. Dedup and Storage

Walmart prices change throughout the day but most items are stable hour-to-hour. Hash the core fields (price, availability, seller) per product per fetch and skip writes when unchanged. This keeps your dataset clean and storage costs low.

Frequently Asked Questions

Is it legal to scrape Walmart?

Scraping publicly visible pages on Walmart.com is generally legal in most jurisdictions, based on the same precedents (hiQ Labs v. LinkedIn in the US) that protect public web scraping. Walmart's Terms of Use prohibit automated access, so scraping violates ToU and can lead to IP blocks, but typically carries no criminal risk for publicly-accessible data.

What is PerimeterX and why does it block my scraper?

PerimeterX (now HUMAN Security Bot Defender) is the primary anti-bot layer Walmart uses. It drops a _px3 cookie on your first visit, fingerprints your browser via Canvas, WebGL, fonts, and sensor-data posts, and serves a "Press & Hold" captcha when scoring drops below a threshold. Any HTTP-only scraper without TLS fingerprint spoofing and a valid _px3 cookie is flagged within 10-20 requests.

What's the best proxy for scraping Walmart?

Rotating residential proxies are required. SpyderProxy Budget Residential at $1.75/GB with a 10M+ residential pool handles most Walmart workloads. For higher-volume scraping where detection is tight, upgrade to Premium Residential at $2.75/GB with 130M+ IPs, or escalate to LTE mobile proxies at $2/IP for the hardest cases. Datacenter proxies do not work on Walmart.

How do I bypass Walmart's Press & Hold captcha?

Three practical paths: (1) run a real browser via Playwright or Puppeteer with a stealth plugin — highest success rate but 10-20x cost per request; (2) use curl_cffi with a warmed _px3 cookie harvested from a periodic browser session — cheapest at scale; (3) use a hosted PerimeterX solver service like Capsolver, 2Captcha, or ScrapFly at $0.50-$2 per 1,000 solves as a fallback.

How many Walmart pages can I scrape per day?

With an HTTP-based stack (curl_cffi + rotating residential + warmed cookie), 30,000-150,000 Walmart pages per day per machine is achievable at sub-5% block rates. With full browser automation via Playwright, 5,000-20,000 pages per day per machine due to 10-20x higher cost per request.

Where does Walmart store product data on the page?

Walmart embeds full product data in a JSON blob inside a script tag with id="__NEXT_DATA__" on every product page. Parse that JSON rather than fragile DOM selectors — the JSON shape is far more stable than the rendered markup and gives you richer fields (seller IDs, inventory, review metadata, category breadcrumbs).

Can I scrape Walmart prices by ZIP code?

Yes. Walmart serves different prices, delivery options, and store-pickup availability per ZIP code. Set the ZIP via the location-setting endpoint or by carrying a pre-set location cookie across your session. Pair with city-level residential proxies matching the ZIP's state or metro for the most accurate pricing — otherwise Walmart may serve a different region's default prices. See our guide on best USA proxies for state-level targeting.

Scrape Walmart Past PerimeterX

SpyderProxy rotating residential proxies start at $1.75/GB with 10M+ clean IPs and auto-rotation. The proxy layer is the foundation of any working Walmart scraper — get it right from the start.