Quick verdict: Data aggregation is the process of pulling data from many sources, normalizing it to a common schema, deduplicating, transforming, and outputting a decision-ready dataset. For web-scraped sources, the pipeline has five stages: collect → normalize → dedupe → transform → output. Each stage has infrastructure decisions: which proxies (collection), which storage (lake vs warehouse), which framework (Spark, dbt, Airflow). Cost at scale: ~$500-2,000/month for 100M-row aggregations.
Two definitions get conflated:
GROUP BY sense.This guide covers the second — the engineering pipeline for combining web-scraped sources into a single, queryable dataset. The first is what you do AFTER aggregation.
| Stage | Purpose | Output |
|---|---|---|
| 1. Collect | Scrape sources | Raw HTML/JSON per source |
| 2. Normalize | Parse to common schema | Structured records |
| 3. Dedupe | Identify duplicates across sources | Unique records |
| 4. Transform | Enrich, compute, validate | Decision-ready data |
| 5. Output | Load to warehouse / API / file | Consumable dataset |
Web sources need scrapers; structured sources need API clients. For aggregation, you usually have a mix of both. Key decisions:
For mixed scraping workloads, SpyderProxy's pricing favors aggregation:
| Workload type | Best proxy | Cost basis |
|---|---|---|
| High-volume product catalogs | Budget Residential | $1.75/GB |
| Protected sites (Cloudflare, DataDome) | Premium Residential | $2.75/GB |
| Static reference data | ISP / Static Residential | $3.90/day |
| Mass-volume datacenter targets | Static Datacenter | $1.50/proxy/month |
| Account-based or LTE-only sources | LTE Mobile | $2/IP unlimited |
Different sources represent the same concept differently. A "smartphone" might be:
{"name": "iPhone 15 Pro Max 256GB", "color": "natural titanium"}{"product": "Apple iPhone 15 Pro Max - 256 GB - Natural Titanium"}{"title": "iPhone 15 Pro Max 256gb Natural Titanium Unlocked"}Normalize to: {"brand": "Apple", "model": "iPhone 15 Pro Max", "storage_gb": 256, "color": "Natural Titanium"}
Techniques:
The same record from multiple sources should collapse to one. Hardest step because identifiers rarely match cleanly. Strategies:
(brand, model, storage_gb, color) after normalizationfrom rapidfuzz import fuzz
def is_match(a, b, threshold=85):
"""Two normalized records likely the same."""
title_score = fuzz.token_set_ratio(a["title"], b["title"])
brand_match = a["brand"].lower() == b["brand"].lower()
return brand_match and title_score >= thresholdFor scale, blocking by brand first (only compare records with same brand) reduces O(n^2) to O(brands * (avg_per_brand^2)).
Compute the derived fields decisions actually need. Common transforms:
Tools for this stage:
Three common consumption patterns:
Stage outputs feed stage inputs. The orchestrator runs them on schedule with retry logic:
Aggregating 100M product records monthly from 10 sources:
| Stage | Cost driver | Estimated monthly |
|---|---|---|
| Collect (residential proxies) | ~50 GB at $2.75/GB | $140 |
| Collect (compute) | Scraper VMs (3 workers) | $150 |
| Storage (raw + structured) | S3 + warehouse | $80 |
| Transform (warehouse compute) | BigQuery / Snowflake | $300 |
| Orchestration (Airflow) | Hosted Airflow / Prefect | $200 |
| Total | ~$870/mo |
For 1B records: roughly 3-5x ($2.5-4.5K/month). The proxy and warehouse compute scale linearly; orchestration is roughly fixed.
Aggregated data without quality checks is worse than no data — bad decisions look authoritative because they came from a "data pipeline." Bake in:
Related: Data quality assurance, Data extraction tools, Web scraping for e-commerce.