spyderproxy

Data Quality Metrics: KPIs for Scraped Pipelines (2026)

D

Daniel K.

|
Published date

Sun May 10 2026

Quick verdict: The 8 metrics worth instrumenting on a scraping pipeline: completeness rate, accuracy rate, freshness lag, dedup rate, schema-drift count, decoy rate, error rate, and per-field coverage. Each has a formula and a sensible threshold. Track them daily; alert when any drops below threshold. Without numbers, "data quality" is a vibes-based discussion.

1. Completeness Rate

Definition: percentage of required fields populated across all scraped rows.

Formula: sum(non-null required fields) / (rows × required field count)

REQUIRED = ["sku", "title", "price", "url"]

def completeness(df):
    total_cells = len(df) * len(REQUIRED)
    populated = sum(df[c].notna().sum() for c in REQUIRED)
    return populated / total_cells

Threshold: >99% for required fields. >80% for optional fields. Alert below.

2. Accuracy Rate

Definition: percentage of values that match a ground-truth source.

Formula: hardest metric to compute because ground truth is expensive. Three approximations:

  • Manual spot-check: sample 100 rows/day, verify against live pages; matches / 100
  • Cross-source: same SKU on multiple sites should agree within X%
  • Internal consistency: field values fit known patterns (currency = ISO 4217, date parses, URL resolves)

Threshold: >95% on spot checks. If lower, your proxies are returning decoy data or your selectors drifted.

3. Freshness Lag

Definition: time between data source update and your dataset reflecting it.

Formula: median(now() - row.scraped_at)

def freshness_lag(df):
    return (pd.Timestamp.utcnow() - pd.to_datetime(df["scraped_at"])).median()

Threshold: depends on use case. Stock prices: <5 min. Product catalog: <24h. Market research: <7 days. Set explicitly per dataset.

4. Dedup Rate

Definition: percentage of scraped rows that are duplicates of an existing record.

Formula: (rows - unique rows) / rows

Threshold: <1% on a well-tuned scraper. Higher means: your crawler is revisiting URLs unnecessarily (waste of proxy bandwidth), or you have duplicate SKU detection bugs, or the site has actual duplicate listings.

5. Schema Drift Count

Definition: number of unexpected fields seen in the last 24h.

When a site changes its layout, your scraper may suddenly find a data-test-id="new-thing" attribute that did not exist before. Track new field names that appear in the scraped output:

seen = set(load_known_fields())
today = set(df.columns)
new = today - seen
drift_alert(new)

Threshold: 0 new fields per day in steady state. Any drift means the site changed — review the scraper.

6. Decoy Rate

Definition: percentage of "successful" scrapes (HTTP 200) that returned anti-bot decoy pages.

How to measure: after each scrape, run a sentinel check — the page must contain a specific known string (e.g., a brand name, a stable nav element). If missing, the scrape returned a decoy:

SENTINELS = ["Add to cart", "© Acme Corp 2026", "Privacy Policy"]

def is_decoy(html):
    return not any(s in html for s in SENTINELS)

Threshold: <1% on clean residential proxies. Higher rates indicate dirty proxy pool or aggressive anti-bot.

7. Error Rate

Definition: percentage of attempted scrapes that failed at the network or parsing layer.

Formula: (timeouts + 4xx + 5xx + parse errors) / total attempts

Track separately:

  • Connection timeouts (DNS, TCP, TLS handshake)
  • HTTP 4xx (auth/permission, often your IP's fault)
  • HTTP 5xx (server-side, retry will work)
  • Parser errors (HTML structure changed)

Threshold: <5% total error rate. >10% means switch proxies or fix the scraper.

8. Per-Field Coverage

Definition: populated rate for each individual field over time.

def per_field_coverage(df):
    return df.notna().mean().to_dict()

Useful for catching field-specific drift — e.g., the discount_price field was 80% populated last week, suddenly 0% today (site changed where discounts render).

Threshold: per-field; alert on >10% drop from rolling 7-day average.

A Sample Dashboard Layout

MetricCurrent7d avgThresholdStatus
Completeness (required)99.4%99.6%>99%OK
Accuracy (spot check)96/10097/100>95OK
Freshness lag (median)2.1h1.9h<24hOK
Dedup rate0.7%0.4%<1%OK
Schema drift0 new00OK
Decoy rate3.2%1.1%<2%WARN
Error rate4.8%3.2%<5%OK

The decoy-rate warning above would prompt an immediate investigation — probably proxy pool contamination.

How Proxy Choice Moves These Numbers

Switching from datacenter to Premium Residential typically moves: decoy rate from ~25% to ~3%, error rate from ~12% to ~4%, accuracy from ~85% to ~96%. The clean-proxy premium ($2.75/GB vs $0.50/GB datacenter) is dwarfed by the data quality gain when downstream decisions depend on the data.

Related: Data quality assurance, Cost of poor data quality, Data extraction tools.