What's the difference between data aggregation and ETL?

ETL (Extract, Transform, Load) is the technical pattern of moving data between systems. Data aggregation is a use case that often uses ETL — specifically pulling from many heterogeneous sources into a unified dataset. All aggregation involves ETL; not all ETL is aggregation (e.g., copying one warehouse to another isn't aggregation).

How do I deduplicate records from multiple scraped sources?

Three approaches, increasingly sophisticated: (1) exact-match on canonical IDs (ISBN, UPC) — rarely available in scraped data; (2) composite key after normalization (brand+model+specs); (3) fuzzy matching with rapidfuzz or sentence embeddings on normalized titles. Always block by a coarse key (brand, category) first to avoid O(n²) comparisons at scale.

What's the cheapest stage of a data aggregation pipeline?

Storage (~$80/mo for typical scales). Most expensive: warehouse compute for transforms (~$300/mo at scale) and proxies for collection (~$140/mo for 50 GB residential). Orchestration is fixed-ish overhead. Optimize transforms first — moving from SQL warehouse to Spark on EC2 saves 50%+ at 10M+ row scale.

Do I need different proxy types for different sources?

Yes, in most aggregations. Public APIs: no proxy. Protected sites (Cloudflare, DataDome): residential. Geo-locked sources: geo-specific residential. Mass-volume scraping of unprotected sites: datacenter. Account-based scraping: LTE mobile. SpyderProxy's per-product pricing lets you match each source to its cost-optimal proxy.

What's the right freshness for aggregated data?

Depends on the downstream consumer. Real-time trading decisions: <5 min freshness, requires streaming aggregation (Flink, Kafka). Daily business decisions: <24h, batch aggregation works. Quarterly market analysis: <30 days, monthly batches. Don't over-spec — every order of magnitude in freshness roughly 5x's pipeline cost.

Should I aggregate to a warehouse or a data lake?

Warehouse (Snowflake, BigQuery, Redshift) if BI tools and analysts are the primary consumers — schemas are enforced, queries are fast. Lake (S3 + Parquet + Athena/Spark) if data scientists / ML are primary — flexible schema, cheaper at PB scale. Lakehouse architectures (Iceberg, Delta) split the difference. Start with warehouse; migrate if costs balloon.

How do I handle a source that changes its schema?

Detect drift early (track new field names per scrape), version your scrapers (don't deploy schema changes without bumping a version), and keep historical scrapes (so you can backfill with the new parser). Quarantine records that fail validation rather than dropping — they're recoverable when you fix the scraper.

What tools should I use for data aggregation orchestration?

Apache Airflow is the safe default (most mature, biggest community). Prefect is the pythonic modern alternative — better local dev story. Dagster is asset-first (better data lineage tracking, growing fast). For very simple pipelines, GitHub Actions cron + Python scripts works fine until you need real dependencies.

Data Aggregation: From Web Scraping to Decision Data (2026)

Alex R.

Sun May 10 2026

Quick verdict: Data aggregation is the process of pulling data from many sources, normalizing it to a common schema, deduplicating, transforming, and outputting a decision-ready dataset. For web-scraped sources, the pipeline has five stages: collect → normalize → dedupe → transform → output. Each stage has infrastructure decisions: which proxies (collection), which storage (lake vs warehouse), which framework (Spark, dbt, Airflow). Cost at scale: ~$500-2,000/month for 100M-row aggregations.

What Data Aggregation Actually Means

Two definitions get conflated:

Statistical aggregation — computing sum/avg/count over a dataset. The SQL GROUP BY sense.
Source aggregation — pulling data from many sources into one place. The "Bloomberg Terminal" sense.

This guide covers the second — the engineering pipeline for combining web-scraped sources into a single, queryable dataset. The first is what you do AFTER aggregation.

The 5-Stage Pipeline

Stage	Purpose	Output
1. Collect	Scrape sources	Raw HTML/JSON per source
2. Normalize	Parse to common schema	Structured records
3. Dedupe	Identify duplicates across sources	Unique records
4. Transform	Enrich, compute, validate	Decision-ready data
5. Output	Load to warehouse / API / file	Consumable dataset

Stage 1: Collect

Web sources need scrapers; structured sources need API clients. For aggregation, you usually have a mix of both. Key decisions:

Proxy type per source: public APIs need no proxy; protected sites need residential; geo-locked sources need geo-specific residential
Schedule: stock prices = every minute; product catalogs = daily; static reference data = monthly
Storage: raw HTML in object storage (S3, GCS), structured JSON in a queue or staging table

For mixed scraping workloads, SpyderProxy's pricing favors aggregation:

Workload type	Best proxy	Cost basis
High-volume product catalogs	Budget Residential	$1.75/GB
Protected sites (Cloudflare, DataDome)	Premium Residential	$2.75/GB
Static reference data	ISP / Static Residential	$3.90/day
Mass-volume datacenter targets	Static Datacenter	$1.50/proxy/month
Account-based or LTE-only sources	LTE Mobile	$2/IP unlimited

Stage 2: Normalize

Different sources represent the same concept differently. A "smartphone" might be:

Source A: {"name": "iPhone 15 Pro Max 256GB", "color": "natural titanium"}
Source B: {"product": "Apple iPhone 15 Pro Max - 256 GB - Natural Titanium"}
Source C: {"title": "iPhone 15 Pro Max 256gb Natural Titanium Unlocked"}

Normalize to: {"brand": "Apple", "model": "iPhone 15 Pro Max", "storage_gb": 256, "color": "Natural Titanium"}

Techniques:

Regex extraction for predictable patterns (sizes, colors, model numbers)
LLM-based extraction for unstructured text (e.g., GPT-4 with structured output)
Embedding similarity for category matching ("phone" / "cell phone" / "mobile" → same)
Reference data lookups (brand from name, country from city)

Stage 3: Deduplicate

The same record from multiple sources should collapse to one. Hardest step because identifiers rarely match cleanly. Strategies:

Exact match on a canonical key (ISBN, UPC, ASIN) — ideal but rare in scraped data
Composite key like (brand, model, storage_gb, color) after normalization
Fuzzy matching on normalized titles using Levenshtein, Jaccard, or sentence embeddings
Manual review for ambiguous cases (recommended: send unmatched pairs to a review queue)

from rapidfuzz import fuzz

def is_match(a, b, threshold=85):
    """Two normalized records likely the same."""
    title_score = fuzz.token_set_ratio(a["title"], b["title"])
    brand_match = a["brand"].lower() == b["brand"].lower()
    return brand_match and title_score >= threshold

For scale, blocking by brand first (only compare records with same brand) reduces O(n^2) to O(brands * (avg_per_brand^2)).

Stage 4: Transform

Compute the derived fields decisions actually need. Common transforms:

Currency conversion to a single base (USD)
Price comparison: min/median/max across sources
Availability rollup: "in stock somewhere" = OR across sources
Trend computation: 7-day moving average, price velocity
Source ranking: which source has the lowest price, fastest shipping, etc.

Tools for this stage:

dbt — SQL-based transforms in a warehouse
Apache Spark — for very large datasets that do not fit a single warehouse
Plain Python + pandas — for <10M rows on a single machine

Stage 5: Output

Three common consumption patterns:

Warehouse table — BI tools (Tableau, Looker) query directly. Best for analyst-driven exploration.
REST API — products consume via API. Best for embedding in apps.
File export — CSV/Parquet on S3. Best for periodic batch consumers.

Orchestrating It All

Stage outputs feed stage inputs. The orchestrator runs them on schedule with retry logic:

Apache Airflow — mature, widely deployed
Prefect — pythonic, modern alternative
Dagster — asset-first model, good for data lineage

Real-World Cost Math

Aggregating 100M product records monthly from 10 sources:

Stage	Cost driver	Estimated monthly
Collect (residential proxies)	~50 GB at $2.75/GB	$140
Collect (compute)	Scraper VMs (3 workers)	$150
Storage (raw + structured)	S3 + warehouse	$80
Transform (warehouse compute)	BigQuery / Snowflake	$300
Orchestration (Airflow)	Hosted Airflow / Prefect	$200
Total		~$870/mo

For 1B records: roughly 3-5x ($2.5-4.5K/month). The proxy and warehouse compute scale linearly; orchestration is roughly fixed.

Don't Forget Quality Checks

Aggregated data without quality checks is worse than no data — bad decisions look authoritative because they came from a "data pipeline." Bake in:

Per-source quality validation at the collect stage
Pipeline-level metrics tracked daily
Quarantine queue for records that fail validation
Backfill / replay mechanism for fixing historical errors