What's the most important data quality metric for scraping?

Decoy rate — the percentage of 200 OK responses that returned anti-bot decoy content instead of real data. Most data quality discussions focus on completeness and accuracy, but decoy is the #1 source of bad scraped data in 2026. If 10% of your 'successful' scrapes are CAPTCHA pages or empty results disguised as normal pages, 10% of your dashboard is wrong.

How do I measure data accuracy without ground truth?

Three approximations: (1) manual spot-check — sample 100 rows/day and verify against live pages; (2) cross-source consistency — the same product on multiple sites should have prices within ~10%; (3) internal validation — currency matches ISO 4217, dates parse, URLs resolve, prices are positive numbers. None are perfect; combined they catch most issues.

What's a reasonable threshold for data freshness?

Depends on use case. Stock/crypto prices: <5 minutes. E-commerce inventory: <1 hour. Product catalog: <24 hours. Market research aggregates: <7 days. SEO tracking: <30 days. Set thresholds explicitly per dataset based on how stale data degrades the downstream decision.

How is schema drift different from a scraper bug?

Schema drift means the source site changed its structure (new field, renamed field, removed field). Your scraper hasn't necessarily broken — it may still extract most fields fine. A scraper bug means your code is wrong. Detection: drift = new field names appearing; bug = stable field names but wrong values. Different fixes.

What's a good error rate target?

Total <5% (timeouts + 4xx + 5xx + parse errors combined). Of that, network errors and transient 5xx are expected (always retry). 4xx from anti-bot detection should be <1% with clean residential proxies. Parse errors should be 0% on well-tested scrapers — if non-zero, the site structure changed and you have schema drift.

How often should I run accuracy spot checks?

Daily sample of 100 rows for high-value datasets (pricing, financial). Weekly for market research. Quarterly for archival data. Automate the sampling, but the comparison itself needs human eyes — you're looking for subtle issues automated checks miss. Track the % match over time as a trend.

Should I track per-field coverage or just overall completeness?

Both. Overall gives you a top-level health signal. Per-field catches issues that overall hides — e.g., the discount_price field dropping from 80% to 0% (site moved that data) is invisible at the aggregate level. Set per-field thresholds (alert on >10% drop from 7-day average).

Does proxy quality really affect these metrics?

Yes, dramatically. Real-world data: switching from datacenter to Premium Residential typically reduces decoy rate from ~25% to ~3%, error rate from ~12% to ~4%, and accuracy improvement from ~85% to ~96%. The proxy bandwidth premium ($2.75/GB vs $0.50/GB) is dwarfed by the value of accurate downstream decisions.

Data Quality Metrics: KPIs for Scraped Pipelines (2026)

Daniel K.

Sun May 10 2026

Quick verdict: The 8 metrics worth instrumenting on a scraping pipeline: completeness rate, accuracy rate, freshness lag, dedup rate, schema-drift count, decoy rate, error rate, and per-field coverage. Each has a formula and a sensible threshold. Track them daily; alert when any drops below threshold. Without numbers, "data quality" is a vibes-based discussion.

1. Completeness Rate

Definition: percentage of required fields populated across all scraped rows.

Formula: sum(non-null required fields) / (rows × required field count)

REQUIRED = ["sku", "title", "price", "url"]

def completeness(df):
    total_cells = len(df) * len(REQUIRED)
    populated = sum(df[c].notna().sum() for c in REQUIRED)
    return populated / total_cells

Threshold: >99% for required fields. >80% for optional fields. Alert below.

2. Accuracy Rate

Definition: percentage of values that match a ground-truth source.

Formula: hardest metric to compute because ground truth is expensive. Three approximations:

Manual spot-check: sample 100 rows/day, verify against live pages; matches / 100
Cross-source: same SKU on multiple sites should agree within X%
Internal consistency: field values fit known patterns (currency = ISO 4217, date parses, URL resolves)

Threshold: >95% on spot checks. If lower, your proxies are returning decoy data or your selectors drifted.

3. Freshness Lag

Definition: time between data source update and your dataset reflecting it.

Formula: median(now() - row.scraped_at)

def freshness_lag(df):
    return (pd.Timestamp.utcnow() - pd.to_datetime(df["scraped_at"])).median()

Threshold: depends on use case. Stock prices: <5 min. Product catalog: <24h. Market research: <7 days. Set explicitly per dataset.

4. Dedup Rate

Definition: percentage of scraped rows that are duplicates of an existing record.

Formula: (rows - unique rows) / rows

Threshold: <1% on a well-tuned scraper. Higher means: your crawler is revisiting URLs unnecessarily (waste of proxy bandwidth), or you have duplicate SKU detection bugs, or the site has actual duplicate listings.

5. Schema Drift Count

Definition: number of unexpected fields seen in the last 24h.

When a site changes its layout, your scraper may suddenly find a data-test-id="new-thing" attribute that did not exist before. Track new field names that appear in the scraped output:

seen = set(load_known_fields())
today = set(df.columns)
new = today - seen
drift_alert(new)

Threshold: 0 new fields per day in steady state. Any drift means the site changed — review the scraper.

6. Decoy Rate

Definition: percentage of "successful" scrapes (HTTP 200) that returned anti-bot decoy pages.

How to measure: after each scrape, run a sentinel check — the page must contain a specific known string (e.g., a brand name, a stable nav element). If missing, the scrape returned a decoy:

SENTINELS = ["Add to cart", "© Acme Corp 2026", "Privacy Policy"]

def is_decoy(html):
    return not any(s in html for s in SENTINELS)

Threshold: <1% on clean residential proxies. Higher rates indicate dirty proxy pool or aggressive anti-bot.

7. Error Rate

Definition: percentage of attempted scrapes that failed at the network or parsing layer.

Formula: (timeouts + 4xx + 5xx + parse errors) / total attempts

Track separately:

Connection timeouts (DNS, TCP, TLS handshake)
HTTP 4xx (auth/permission, often your IP's fault)
HTTP 5xx (server-side, retry will work)
Parser errors (HTML structure changed)

Threshold: <5% total error rate. >10% means switch proxies or fix the scraper.

8. Per-Field Coverage

Definition: populated rate for each individual field over time.

def per_field_coverage(df):
    return df.notna().mean().to_dict()

Useful for catching field-specific drift — e.g., the discount_price field was 80% populated last week, suddenly 0% today (site changed where discounts render).

Threshold: per-field; alert on >10% drop from rolling 7-day average.

A Sample Dashboard Layout

Metric	Current	7d avg	Threshold	Status
Completeness (required)	99.4%	99.6%	>99%	OK
Accuracy (spot check)	96/100	97/100	>95	OK
Freshness lag (median)	2.1h	1.9h	<24h	OK
Dedup rate	0.7%	0.4%	<1%	OK
Schema drift	0 new	0	0	OK
Decoy rate	3.2%	1.1%	<2%	WARN
Error rate	4.8%	3.2%	<5%	OK

The decoy-rate warning above would prompt an immediate investigation — probably proxy pool contamination.

How Proxy Choice Moves These Numbers

Switching from datacenter to Premium Residential typically moves: decoy rate from ~25% to ~3%, error rate from ~12% to ~4%, accuracy from ~85% to ~96%. The clean-proxy premium ($2.75/GB vs $0.50/GB datacenter) is dwarfed by the data quality gain when downstream decisions depend on the data.