Quick verdict: Data quality assurance (DQA) for web-scraped data has 5 dimensions: completeness, accuracy, consistency, timeliness, and validity. The biggest source of bad data is anti-bot fakeouts — pages that returned 200 but with junk content (CAPTCHA pages, "we noticed you are a bot," empty results). The biggest defense is monitoring page structure: if the scraper expects 50 fields and only finds 12, flag the row. Proxy quality matters too — flagged IPs return fake/empty pages 200 OK; residential pools have a much lower poison rate.
| Dimension | Question | Example check |
|---|---|---|
| Completeness | Are all expected fields present? | price field non-null on 99%+ of records |
| Accuracy | Do values match reality? | price within 50% of last week's value |
| Consistency | Do related fields agree? | SKU + product_name are stable; price changes |
| Timeliness | Is the data fresh? | scraped within last 24h |
| Validity | Are formats correct? | price is numeric, dates parse, URLs resolve |
Modern anti-bot systems often do not return 403 or 503 anymore. Instead they return 200 OK with:
Your HTTP-status check passes. Your scraper extracts what it can. Your downstream dashboard now shows wrong numbers.
Defenses:
Required fields should never be null. For each row:
import pandas as pd
REQUIRED = ["sku", "title", "price", "currency", "url"]
def completeness_score(df):
out = {}
for col in REQUIRED:
if col in df.columns:
out[col] = df[col].notna().mean()
else:
out[col] = 0.0
return out
# Alert if any required field drops below 95% completeness
for col, score in completeness_score(df).items():
if score < 0.95:
alert(f"{col} only {score:.1%} complete")For optional fields, track the rate over time. A sudden drop from 80% to 20% means the site changed structure.
Hardest to verify because "correct" depends on the truth. Practical approximations:
def detect_price_outliers(df_today, df_yesterday):
"""Flag SKUs with >50% price change."""
merged = df_today.merge(df_yesterday, on="sku", suffixes=("_t", "_y"))
merged["pct_change"] = (merged["price_t"] - merged["price_y"]) / merged["price_y"]
return merged[merged["pct_change"].abs() > 0.5]Some fields must agree. Example: if you scrape product pages and search-result pages, the same SKU should have the same title in both:
def title_consistency(df1, df2):
merged = df1.merge(df2, on="sku", suffixes=("_search", "_product"))
mismatched = merged[merged["title_search"].str.lower() != merged["title_product"].str.lower()]
return mismatchedOther consistency checks: currency symbol matches currency code, stated availability matches inventory count, breadcrumb path matches category field.
Stale data is wrong data. Track scraped_at per row:
def staleness_alert(df, max_age_hours=24):
now = datetime.utcnow()
df["age_hours"] = (now - pd.to_datetime(df["scraped_at"])).dt.total_seconds() / 3600
stale = df[df["age_hours"] > max_age_hours]
if len(stale) > 0:
alert(f"{len(stale)} records older than {max_age_hours}h")For real-time price monitoring, hours matter. For monthly market research, days. Set thresholds per dataset.
import re
def validate_row(row):
errors = []
if not isinstance(row["price"], (int, float)) or row["price"] < 0:
errors.append("price not positive number")
if not re.match(r"^https?://", str(row["url"])):
errors.append("url not http(s)")
if not re.match(r"^[A-Z]{3}$", str(row["currency"])):
errors.append("currency not ISO 4217")
return errorsApply per row, flag failures, quarantine for manual review.
The chain: bad proxy → flagged by site → site returns decoy/empty/CAPTCHA → scraper "succeeds" with bad data → bad data in dashboard.
| Proxy type | Decoy/empty rate (typical) | Cost per clean GB |
|---|---|---|
| Free public proxies | 40-80% | Free (but unusable) |
| Datacenter | 20-50% on protected sites | Very low (but for protected sites: high effective cost) |
| Budget Residential | 5-15% | $1.75/GB |
| Premium Residential | 1-5% | $2.75/GB |
| LTE Mobile | <1% | $2/IP unlimited |
For data-quality-sensitive workloads, the higher per-GB cost of clean proxies is offset by drastically fewer false positives downstream. A 10% decoy rate means 10% of your dashboard is wrong — far more expensive than the GB markup.
Related: Data quality metrics, Cost of poor data quality, Python requests retry.