Quick verdict: The 8 metrics worth instrumenting on a scraping pipeline: completeness rate, accuracy rate, freshness lag, dedup rate, schema-drift count, decoy rate, error rate, and per-field coverage. Each has a formula and a sensible threshold. Track them daily; alert when any drops below threshold. Without numbers, "data quality" is a vibes-based discussion.
Definition: percentage of required fields populated across all scraped rows.
Formula: sum(non-null required fields) / (rows × required field count)
REQUIRED = ["sku", "title", "price", "url"]
def completeness(df):
total_cells = len(df) * len(REQUIRED)
populated = sum(df[c].notna().sum() for c in REQUIRED)
return populated / total_cellsThreshold: >99% for required fields. >80% for optional fields. Alert below.
Definition: percentage of values that match a ground-truth source.
Formula: hardest metric to compute because ground truth is expensive. Three approximations:
matches / 100Threshold: >95% on spot checks. If lower, your proxies are returning decoy data or your selectors drifted.
Definition: time between data source update and your dataset reflecting it.
Formula: median(now() - row.scraped_at)
def freshness_lag(df):
return (pd.Timestamp.utcnow() - pd.to_datetime(df["scraped_at"])).median()Threshold: depends on use case. Stock prices: <5 min. Product catalog: <24h. Market research: <7 days. Set explicitly per dataset.
Definition: percentage of scraped rows that are duplicates of an existing record.
Formula: (rows - unique rows) / rows
Threshold: <1% on a well-tuned scraper. Higher means: your crawler is revisiting URLs unnecessarily (waste of proxy bandwidth), or you have duplicate SKU detection bugs, or the site has actual duplicate listings.
Definition: number of unexpected fields seen in the last 24h.
When a site changes its layout, your scraper may suddenly find a data-test-id="new-thing" attribute that did not exist before. Track new field names that appear in the scraped output:
seen = set(load_known_fields())
today = set(df.columns)
new = today - seen
drift_alert(new)Threshold: 0 new fields per day in steady state. Any drift means the site changed — review the scraper.
Definition: percentage of "successful" scrapes (HTTP 200) that returned anti-bot decoy pages.
How to measure: after each scrape, run a sentinel check — the page must contain a specific known string (e.g., a brand name, a stable nav element). If missing, the scrape returned a decoy:
SENTINELS = ["Add to cart", "© Acme Corp 2026", "Privacy Policy"]
def is_decoy(html):
return not any(s in html for s in SENTINELS)Threshold: <1% on clean residential proxies. Higher rates indicate dirty proxy pool or aggressive anti-bot.
Definition: percentage of attempted scrapes that failed at the network or parsing layer.
Formula: (timeouts + 4xx + 5xx + parse errors) / total attempts
Track separately:
Threshold: <5% total error rate. >10% means switch proxies or fix the scraper.
Definition: populated rate for each individual field over time.
def per_field_coverage(df):
return df.notna().mean().to_dict()Useful for catching field-specific drift — e.g., the discount_price field was 80% populated last week, suddenly 0% today (site changed where discounts render).
Threshold: per-field; alert on >10% drop from rolling 7-day average.
| Metric | Current | 7d avg | Threshold | Status |
|---|---|---|---|---|
| Completeness (required) | 99.4% | 99.6% | >99% | OK |
| Accuracy (spot check) | 96/100 | 97/100 | >95 | OK |
| Freshness lag (median) | 2.1h | 1.9h | <24h | OK |
| Dedup rate | 0.7% | 0.4% | <1% | OK |
| Schema drift | 0 new | 0 | 0 | OK |
| Decoy rate | 3.2% | 1.1% | <2% | WARN |
| Error rate | 4.8% | 3.2% | <5% | OK |
The decoy-rate warning above would prompt an immediate investigation — probably proxy pool contamination.
Switching from datacenter to Premium Residential typically moves: decoy rate from ~25% to ~3%, error rate from ~12% to ~4%, accuracy from ~85% to ~96%. The clean-proxy premium ($2.75/GB vs $0.50/GB datacenter) is dwarfed by the data quality gain when downstream decisions depend on the data.
Related: Data quality assurance, Cost of poor data quality, Data extraction tools.