What is data quality assurance for scraped data?

The process of validating that web-scraped data is complete, accurate, consistent, timely, and valid before it reaches downstream consumers. Critical because anti-bot systems increasingly return 200 OK with decoy/empty content rather than 403, so HTTP-level checks aren't enough.

Why does proxy quality affect data quality?

Flagged IPs (free public, often datacenter) get served decoy content — superficially valid pages with empty results, fake prices, or 'rate limited' messages disguised as normal pages. Your scraper extracts what it finds; your dashboard reports junk. Clean residential proxies have 1-5% decoy rates vs 40-80% for free proxies.

How do I detect a CAPTCHA page returned as 200?

Three checks: (1) look for known CAPTCHA service markers (datadome, geo.captcha-delivery.com, recaptcha.net). (2) field-count check — if you usually extract 50 fields and find 2, alert. (3) content marker — page must contain a specific string you know is on real pages (e.g., product SKU, currency symbol, common nav text).

What's the difference between completeness and accuracy?

Completeness asks 'is the data there?' (no nulls in required fields). Accuracy asks 'is the data correct?' (matches reality). A page can be 100% complete (all fields populated) but 0% accurate (all values are decoys). Completeness is easy to check; accuracy requires ground truth comparison.

How often should I validate scraped data?

Three levels: real-time (per-row validation against schema, quarantine failures), batch (compute metrics over each scrape batch, alert on thresholds), and periodic (daily comparison to historical baselines, weekly spot-checks against live pages, quarterly schema review). Automation handles the first two; manual review the rest.

What are 'decoy values' and how do retailers use them?

Some retailers deliberately serve scrapers slightly wrong data — prices off by a few percent, fake stock counts — to detect competitive monitoring tools. Your scraper succeeds, but the numbers don't match the real site. Detection: cross-source comparison (same SKU on multiple sites shouldn't diverge wildly) and periodic manual verification.

Should I retry rows that fail validation?

Yes for transient issues (specific fields missing on one scrape, succeed on retry). No for systematic issues (site changed structure — fix the scraper). Distinguish by checking whether the failure is per-row (likely transient) or per-batch (systematic). Retry transient with fresh proxy IP; queue systematic for engineer review.

What's a reasonable completeness threshold for scraped data?

Depends on the field. Required fields (price, SKU, URL): 99%+ should be present. Optional fields (description, image, ratings): 80%+. Schema-evolution fields (newly added columns): start tracking, alert on drops. Set per-field thresholds rather than a global one.

Data Quality Assurance for Scraped Data (2026 Guide)

Daniel K.

Sun May 10 2026

Quick verdict: Data quality assurance (DQA) for web-scraped data has 5 dimensions: completeness, accuracy, consistency, timeliness, and validity. The biggest source of bad data is anti-bot fakeouts — pages that returned 200 but with junk content (CAPTCHA pages, "we noticed you are a bot," empty results). The biggest defense is monitoring page structure: if the scraper expects 50 fields and only finds 12, flag the row. Proxy quality matters too — flagged IPs return fake/empty pages 200 OK; residential pools have a much lower poison rate.

The 5 Dimensions of Data Quality

Dimension	Question	Example check
Completeness	Are all expected fields present?	price field non-null on 99%+ of records
Accuracy	Do values match reality?	price within 50% of last week's value
Consistency	Do related fields agree?	SKU + product_name are stable; price changes
Timeliness	Is the data fresh?	scraped within last 24h
Validity	Are formats correct?	price is numeric, dates parse, URLs resolve

The #1 Threat: Anti-Bot Fakeouts

Modern anti-bot systems often do not return 403 or 503 anymore. Instead they return 200 OK with:

A CAPTCHA page that looks superficially like the real page
An empty listing page ("no results found") when the site has thousands
Decoy data with subtly wrong values (some retailers do this to detect scrapers)
A "rate limited" message hidden inside a normal-looking layout

Your HTTP-status check passes. Your scraper extracts what it can. Your downstream dashboard now shows wrong numbers.

Defenses:

Check for specific content markers (the page MUST contain a known string)
Check field counts — if you usually get 50 product cards and get 0, alert
Check field value distributions — if all prices are suddenly "$0," alert
Run a sample of scrapes through a real browser comparison to catch drift
Use clean residential proxies — flagged IP pools return decoy data more often

Completeness Checks

Required fields should never be null. For each row:

import pandas as pd

REQUIRED = ["sku", "title", "price", "currency", "url"]

def completeness_score(df):
    out = {}
    for col in REQUIRED:
        if col in df.columns:
            out[col] = df[col].notna().mean()
        else:
            out[col] = 0.0
    return out

# Alert if any required field drops below 95% completeness
for col, score in completeness_score(df).items():
    if score < 0.95:
        alert(f"{col} only {score:.1%} complete")

For optional fields, track the rate over time. A sudden drop from 80% to 20% means the site changed structure.

Accuracy Checks

Hardest to verify because "correct" depends on the truth. Practical approximations:

Range bounds: a SKU's price should not change by >50% day-over-day (unless flash sales are expected).
Format consistency: if 99% of prices have 2 decimal places, the 1% with no decimals are suspicious.
Cross-source validation: the same product on multiple retailers should not have wildly different prices.
Spot checks: sample 100 records/day and manually verify against the live page.

def detect_price_outliers(df_today, df_yesterday):
    """Flag SKUs with >50% price change."""
    merged = df_today.merge(df_yesterday, on="sku", suffixes=("_t", "_y"))
    merged["pct_change"] = (merged["price_t"] - merged["price_y"]) / merged["price_y"]
    return merged[merged["pct_change"].abs() > 0.5]

Consistency

Some fields must agree. Example: if you scrape product pages and search-result pages, the same SKU should have the same title in both:

def title_consistency(df1, df2):
    merged = df1.merge(df2, on="sku", suffixes=("_search", "_product"))
    mismatched = merged[merged["title_search"].str.lower() != merged["title_product"].str.lower()]
    return mismatched

Other consistency checks: currency symbol matches currency code, stated availability matches inventory count, breadcrumb path matches category field.

Timeliness

Stale data is wrong data. Track scraped_at per row:

def staleness_alert(df, max_age_hours=24):
    now = datetime.utcnow()
    df["age_hours"] = (now - pd.to_datetime(df["scraped_at"])).dt.total_seconds() / 3600
    stale = df[df["age_hours"] > max_age_hours]
    if len(stale) > 0:
        alert(f"{len(stale)} records older than {max_age_hours}h")

For real-time price monitoring, hours matter. For monthly market research, days. Set thresholds per dataset.

Validity (Format Checks)

import re

def validate_row(row):
    errors = []
    if not isinstance(row["price"], (int, float)) or row["price"] < 0:
        errors.append("price not positive number")
    if not re.match(r"^https?://", str(row["url"])):
        errors.append("url not http(s)")
    if not re.match(r"^[A-Z]{3}$", str(row["currency"])):
        errors.append("currency not ISO 4217")
    return errors

Apply per row, flag failures, quarantine for manual review.

How Proxy Quality Affects Data Quality

The chain: bad proxy → flagged by site → site returns decoy/empty/CAPTCHA → scraper "succeeds" with bad data → bad data in dashboard.

Proxy type	Decoy/empty rate (typical)	Cost per clean GB
Free public proxies	40-80%	Free (but unusable)
Datacenter	20-50% on protected sites	Very low (but for protected sites: high effective cost)
Budget Residential	5-15%	$1.75/GB
Premium Residential	1-5%	$2.75/GB
LTE Mobile	<1%	$2/IP unlimited

For data-quality-sensitive workloads, the higher per-GB cost of clean proxies is offset by drastically fewer false positives downstream. A 10% decoy rate means 10% of your dashboard is wrong — far more expensive than the GB markup.

Setting Up Continuous Monitoring

Per-scrape: validate each row against your schema, drop or quarantine failures.
Per-batch: compute completeness/accuracy metrics, alert on threshold breaches.
Daily: compare today's aggregate stats to last 7 days; flag deviations.
Weekly: spot-check 100 random rows against live pages; track drift.
Quarterly: re-review schema against site changes — new fields, renamed fields, deprecated structures.

Tools

Great Expectations — Python framework for data validation with declarative expectations.
Deequ — AWS/Spark-based, profile + validate large datasets.
dbt tests — SQL-based validation if your data lives in a warehouse.
pandera — lightweight Python schema validation for pandas DataFrames.