spyderproxy

Data Aggregation: From Web Scraping to Decision Data (2026)

A

Alex R.

|
Published date

Sun May 10 2026

Quick verdict: Data aggregation is the process of pulling data from many sources, normalizing it to a common schema, deduplicating, transforming, and outputting a decision-ready dataset. For web-scraped sources, the pipeline has five stages: collect → normalize → dedupe → transform → output. Each stage has infrastructure decisions: which proxies (collection), which storage (lake vs warehouse), which framework (Spark, dbt, Airflow). Cost at scale: ~$500-2,000/month for 100M-row aggregations.

What Data Aggregation Actually Means

Two definitions get conflated:

  1. Statistical aggregation — computing sum/avg/count over a dataset. The SQL GROUP BY sense.
  2. Source aggregation — pulling data from many sources into one place. The "Bloomberg Terminal" sense.

This guide covers the second — the engineering pipeline for combining web-scraped sources into a single, queryable dataset. The first is what you do AFTER aggregation.

The 5-Stage Pipeline

StagePurposeOutput
1. CollectScrape sourcesRaw HTML/JSON per source
2. NormalizeParse to common schemaStructured records
3. DedupeIdentify duplicates across sourcesUnique records
4. TransformEnrich, compute, validateDecision-ready data
5. OutputLoad to warehouse / API / fileConsumable dataset

Stage 1: Collect

Web sources need scrapers; structured sources need API clients. For aggregation, you usually have a mix of both. Key decisions:

  • Proxy type per source: public APIs need no proxy; protected sites need residential; geo-locked sources need geo-specific residential
  • Schedule: stock prices = every minute; product catalogs = daily; static reference data = monthly
  • Storage: raw HTML in object storage (S3, GCS), structured JSON in a queue or staging table

For mixed scraping workloads, SpyderProxy's pricing favors aggregation:

Workload typeBest proxyCost basis
High-volume product catalogsBudget Residential$1.75/GB
Protected sites (Cloudflare, DataDome)Premium Residential$2.75/GB
Static reference dataISP / Static Residential$3.90/day
Mass-volume datacenter targetsStatic Datacenter$1.50/proxy/month
Account-based or LTE-only sourcesLTE Mobile$2/IP unlimited

Stage 2: Normalize

Different sources represent the same concept differently. A "smartphone" might be:

  • Source A: {"name": "iPhone 15 Pro Max 256GB", "color": "natural titanium"}
  • Source B: {"product": "Apple iPhone 15 Pro Max - 256 GB - Natural Titanium"}
  • Source C: {"title": "iPhone 15 Pro Max 256gb Natural Titanium Unlocked"}

Normalize to: {"brand": "Apple", "model": "iPhone 15 Pro Max", "storage_gb": 256, "color": "Natural Titanium"}

Techniques:

  • Regex extraction for predictable patterns (sizes, colors, model numbers)
  • LLM-based extraction for unstructured text (e.g., GPT-4 with structured output)
  • Embedding similarity for category matching ("phone" / "cell phone" / "mobile" → same)
  • Reference data lookups (brand from name, country from city)

Stage 3: Deduplicate

The same record from multiple sources should collapse to one. Hardest step because identifiers rarely match cleanly. Strategies:

  • Exact match on a canonical key (ISBN, UPC, ASIN) — ideal but rare in scraped data
  • Composite key like (brand, model, storage_gb, color) after normalization
  • Fuzzy matching on normalized titles using Levenshtein, Jaccard, or sentence embeddings
  • Manual review for ambiguous cases (recommended: send unmatched pairs to a review queue)
from rapidfuzz import fuzz

def is_match(a, b, threshold=85):
    """Two normalized records likely the same."""
    title_score = fuzz.token_set_ratio(a["title"], b["title"])
    brand_match = a["brand"].lower() == b["brand"].lower()
    return brand_match and title_score >= threshold

For scale, blocking by brand first (only compare records with same brand) reduces O(n^2) to O(brands * (avg_per_brand^2)).

Stage 4: Transform

Compute the derived fields decisions actually need. Common transforms:

  • Currency conversion to a single base (USD)
  • Price comparison: min/median/max across sources
  • Availability rollup: "in stock somewhere" = OR across sources
  • Trend computation: 7-day moving average, price velocity
  • Source ranking: which source has the lowest price, fastest shipping, etc.

Tools for this stage:

  • dbt — SQL-based transforms in a warehouse
  • Apache Spark — for very large datasets that do not fit a single warehouse
  • Plain Python + pandas — for <10M rows on a single machine

Stage 5: Output

Three common consumption patterns:

  • Warehouse table — BI tools (Tableau, Looker) query directly. Best for analyst-driven exploration.
  • REST API — products consume via API. Best for embedding in apps.
  • File export — CSV/Parquet on S3. Best for periodic batch consumers.

Orchestrating It All

Stage outputs feed stage inputs. The orchestrator runs them on schedule with retry logic:

Real-World Cost Math

Aggregating 100M product records monthly from 10 sources:

StageCost driverEstimated monthly
Collect (residential proxies)~50 GB at $2.75/GB$140
Collect (compute)Scraper VMs (3 workers)$150
Storage (raw + structured)S3 + warehouse$80
Transform (warehouse compute)BigQuery / Snowflake$300
Orchestration (Airflow)Hosted Airflow / Prefect$200
Total~$870/mo

For 1B records: roughly 3-5x ($2.5-4.5K/month). The proxy and warehouse compute scale linearly; orchestration is roughly fixed.

Don't Forget Quality Checks

Aggregated data without quality checks is worse than no data — bad decisions look authoritative because they came from a "data pipeline." Bake in:

Related: Data quality assurance, Data extraction tools, Web scraping for e-commerce.