spyderproxy

Data Quality Assurance for Scraped Data (2026 Guide)

D

Daniel K.

|
Published date

Sun May 10 2026

Quick verdict: Data quality assurance (DQA) for web-scraped data has 5 dimensions: completeness, accuracy, consistency, timeliness, and validity. The biggest source of bad data is anti-bot fakeouts — pages that returned 200 but with junk content (CAPTCHA pages, "we noticed you are a bot," empty results). The biggest defense is monitoring page structure: if the scraper expects 50 fields and only finds 12, flag the row. Proxy quality matters too — flagged IPs return fake/empty pages 200 OK; residential pools have a much lower poison rate.

The 5 Dimensions of Data Quality

DimensionQuestionExample check
CompletenessAre all expected fields present?price field non-null on 99%+ of records
AccuracyDo values match reality?price within 50% of last week's value
ConsistencyDo related fields agree?SKU + product_name are stable; price changes
TimelinessIs the data fresh?scraped within last 24h
ValidityAre formats correct?price is numeric, dates parse, URLs resolve

The #1 Threat: Anti-Bot Fakeouts

Modern anti-bot systems often do not return 403 or 503 anymore. Instead they return 200 OK with:

  • A CAPTCHA page that looks superficially like the real page
  • An empty listing page ("no results found") when the site has thousands
  • Decoy data with subtly wrong values (some retailers do this to detect scrapers)
  • A "rate limited" message hidden inside a normal-looking layout

Your HTTP-status check passes. Your scraper extracts what it can. Your downstream dashboard now shows wrong numbers.

Defenses:

  1. Check for specific content markers (the page MUST contain a known string)
  2. Check field counts — if you usually get 50 product cards and get 0, alert
  3. Check field value distributions — if all prices are suddenly "$0," alert
  4. Run a sample of scrapes through a real browser comparison to catch drift
  5. Use clean residential proxies — flagged IP pools return decoy data more often

Completeness Checks

Required fields should never be null. For each row:

import pandas as pd

REQUIRED = ["sku", "title", "price", "currency", "url"]

def completeness_score(df):
    out = {}
    for col in REQUIRED:
        if col in df.columns:
            out[col] = df[col].notna().mean()
        else:
            out[col] = 0.0
    return out

# Alert if any required field drops below 95% completeness
for col, score in completeness_score(df).items():
    if score < 0.95:
        alert(f"{col} only {score:.1%} complete")

For optional fields, track the rate over time. A sudden drop from 80% to 20% means the site changed structure.

Accuracy Checks

Hardest to verify because "correct" depends on the truth. Practical approximations:

  • Range bounds: a SKU's price should not change by >50% day-over-day (unless flash sales are expected).
  • Format consistency: if 99% of prices have 2 decimal places, the 1% with no decimals are suspicious.
  • Cross-source validation: the same product on multiple retailers should not have wildly different prices.
  • Spot checks: sample 100 records/day and manually verify against the live page.
def detect_price_outliers(df_today, df_yesterday):
    """Flag SKUs with >50% price change."""
    merged = df_today.merge(df_yesterday, on="sku", suffixes=("_t", "_y"))
    merged["pct_change"] = (merged["price_t"] - merged["price_y"]) / merged["price_y"]
    return merged[merged["pct_change"].abs() > 0.5]

Consistency

Some fields must agree. Example: if you scrape product pages and search-result pages, the same SKU should have the same title in both:

def title_consistency(df1, df2):
    merged = df1.merge(df2, on="sku", suffixes=("_search", "_product"))
    mismatched = merged[merged["title_search"].str.lower() != merged["title_product"].str.lower()]
    return mismatched

Other consistency checks: currency symbol matches currency code, stated availability matches inventory count, breadcrumb path matches category field.

Timeliness

Stale data is wrong data. Track scraped_at per row:

def staleness_alert(df, max_age_hours=24):
    now = datetime.utcnow()
    df["age_hours"] = (now - pd.to_datetime(df["scraped_at"])).dt.total_seconds() / 3600
    stale = df[df["age_hours"] > max_age_hours]
    if len(stale) > 0:
        alert(f"{len(stale)} records older than {max_age_hours}h")

For real-time price monitoring, hours matter. For monthly market research, days. Set thresholds per dataset.

Validity (Format Checks)

import re

def validate_row(row):
    errors = []
    if not isinstance(row["price"], (int, float)) or row["price"] < 0:
        errors.append("price not positive number")
    if not re.match(r"^https?://", str(row["url"])):
        errors.append("url not http(s)")
    if not re.match(r"^[A-Z]{3}$", str(row["currency"])):
        errors.append("currency not ISO 4217")
    return errors

Apply per row, flag failures, quarantine for manual review.

How Proxy Quality Affects Data Quality

The chain: bad proxy → flagged by site → site returns decoy/empty/CAPTCHA → scraper "succeeds" with bad data → bad data in dashboard.

Proxy typeDecoy/empty rate (typical)Cost per clean GB
Free public proxies40-80%Free (but unusable)
Datacenter20-50% on protected sitesVery low (but for protected sites: high effective cost)
Budget Residential5-15%$1.75/GB
Premium Residential1-5%$2.75/GB
LTE Mobile<1%$2/IP unlimited

For data-quality-sensitive workloads, the higher per-GB cost of clean proxies is offset by drastically fewer false positives downstream. A 10% decoy rate means 10% of your dashboard is wrong — far more expensive than the GB markup.

Setting Up Continuous Monitoring

  1. Per-scrape: validate each row against your schema, drop or quarantine failures.
  2. Per-batch: compute completeness/accuracy metrics, alert on threshold breaches.
  3. Daily: compare today's aggregate stats to last 7 days; flag deviations.
  4. Weekly: spot-check 100 random rows against live pages; track drift.
  5. Quarterly: re-review schema against site changes — new fields, renamed fields, deprecated structures.

Tools

  • Great Expectations — Python framework for data validation with declarative expectations.
  • Deequ — AWS/Spark-based, profile + validate large datasets.
  • dbt tests — SQL-based validation if your data lives in a warehouse.
  • pandera — lightweight Python schema validation for pandas DataFrames.

Related: Data quality metrics, Cost of poor data quality, Python requests retry.