Web Scraping Engineer · SpyderProxy
Alex R. is a Web Scraping Engineer at SpyderProxy specializing in Python, Playwright, and anti-bot evasion. She writes tutorials on scraping modern sites (Reddit, Instagram, Google, LinkedIn), proxy configuration for scrapers, and picking the right tooling (Puppeteer vs Selenium vs Playwright) for each job.
ChatGPT (and Claude, Gemini, etc.) help with web scraping in three distinct ways: (1) generating selectors and code from a sample HTML, (2) acting as the parser itself via the API for structured extraction, (3) building agentic browser flows for multi-step tasks. This guide shows the right prompt patterns, the API code for production use, and where vibe-coded scrapers break.
2026-05-18
Complete 2026 guide to CAPTCHA bypass — solver APIs (Capsolver, 2Captcha, NopeCHA), vision LLM solving (GPT-4V, Claude), Turnstile and hCaptcha workarounds, and the underrated technique of avoiding triggers entirely with clean residential proxies + realistic fingerprints. Costs, accuracy, and which method fits which CAPTCHA.
2026-05-18
The right proxy type for each social platform in 2026, tested against real account-management workloads. LTE mobile is the gold standard for Instagram and TikTok; static residential (ISP) works for desktop-managed Facebook and LinkedIn; rotating residential is for scraping public profiles, not for running accounts. Plus the warm-up rules that make any of this work.
2026-05-18
Rotating proxies change IP per request (or per sticky session) and pull from pools of millions — built for crawling, freshness, and avoiding rate limits. Static proxies stay on one IP — built for account management, identity continuity, and any flow that needs cookies, sessions, or warm-up to survive. This 2026 guide covers when each wins, the pricing tradeoffs, and how serious operators combine them.
2026-05-17
AI scraping uses LLMs (or LLM-driven agents) to read a page and return structured JSON without you writing a single CSS selector or XPath. This guide explains how it works, the 2026 toolset (Firecrawl, ScrapeGraphAI, Browser-use, Reworkd, Stagehand), what it costs per page, and when traditional scraping still wins.
2026-05-16
Complete 2026 guide to scraping images from any website with Python: requests + BeautifulSoup for static pages, async httpx for volume, Playwright for lazy-loaded galleries. Includes srcset handling, deduplication, alt-text extraction, and proxy rotation.
2026-05-16
Five free tools to verify your browser fingerprint in 2026: CreepJS (most thorough), EFF Cover Your Tracks (most user-friendly), FingerprintJS demo (sees what commercial trackers see), AmIUnique (uniqueness ranking), SannySoft (bot detection). Each tool catches different issues — use them together for a complete privacy audit.
2026-05-11
CreepJS is the most thorough open-source browser fingerprinting test in 2026 — it detects more signals than FingerprintJS or Panopticlick, including anti-fingerprinting tools themselves. This guide walks through reading its output, identifying the signals that give you away, and using it to validate antidetect browser setups before deploying.
2026-05-10
Canvas fingerprinting is the technique of identifying browsers by rendering a hidden image and hashing the pixel output. Same browser version + same OS + same GPU = same hash; different rendering stack = different hash. This guide explains how it works, why it survives cookie clearing, and how scrapers and antidetect browsers spoof it.
2026-05-10
Three Python libraries run headless browsers in 2026: Playwright (modern default), Selenium (legacy mainstay), and Pyppeteer (Python port of Puppeteer). This tutorial covers installation, a working scraper, proxy setup, stealth-mode for anti-bot evasion, and the practical pitfalls (memory, timeouts, async-vs-sync).
2026-05-10
Data extraction tools split into three categories: code-first frameworks (Scrapy, Playwright, Crawlee), managed scraping APIs (ScrapingBee, Bright Data Web Scraper API, Apify), and no-code GUI tools (Octoparse, ParseHub). This guide compares 10 of them on pricing, scalability, anti-bot handling, and the build-vs-buy decision.
2026-05-10
Data aggregation turns raw scraped data into decision-ready datasets. This guide walks through the five-stage pipeline (collect, normalize, dedupe, transform, output), the infrastructure decisions at each stage (which proxies, which storage, which transformation framework), and the math on how much it costs to run.
2026-05-10
cURL's default User-Agent is 'curl/X.Y.Z' — instantly recognizable as a bot. The -A flag sets a custom UA. This guide covers when spoofing matters, the current Chrome/Firefox/Safari strings, and why UA alone won't bypass modern bot detection (TLS fingerprinting matters more).
2026-05-10
cURL defaults to GET — no -X flag needed. This guide covers query string handling, headers, cookies, accept-encoding, response-only output, and the proxy patterns for routing GETs through residential or datacenter proxies. Six working examples for the patterns that actually come up in scripts.
2026-05-10
CSS selectors are the standard way to target elements in HTML — used by browsers, BeautifulSoup, Playwright, Cheerio, and every scraping framework. This cheat sheet covers all the syntax developers actually use day-to-day: selectors, combinators, pseudo-classes, attribute matchers, plus when XPath does it better.
2026-05-10
Python asyncio makes concurrent web scraping 10-100x faster than threaded requests. This tutorial covers the core primitives (async/await, gather, semaphore), the two main HTTP clients (aiohttp and httpx), and the working pattern for scraping at scale with rotating residential proxies.
2026-05-10
FlareSolverr is a proxy server that runs undetected-chromedriver behind the scenes to solve Cloudflare's modern challenges, including Turnstile. This guide covers Docker setup, the Python client pattern, proxy integration, common errors, and when to pick FlareSolverr over Cloudscraper or curl_cffi.
2026-05-10
Glassdoor protects its salary and review data with DataDome and a soft-login wall. Plain requests fails. This guide covers the architecture (when login is required, when it isn't), the proxy/browser setup that works, and a Playwright-based scraper for jobs, salaries, and visible reviews.
2026-05-10
Indeed is one of the largest job boards, with millions of postings. This guide covers the URL structure, the anti-bot challenges (Cloudflare + custom JS detection), the proxy setup that actually works (rotating residential), and a working Python scraper that pulls title, company, location, salary, and description.
2026-05-10
Pandas read_html turns any HTML table into a DataFrame in one line. It is the fastest way to scrape tabular data: stock tickers, sports stats, Wikipedia tables, financial reports. This guide covers the basics, common gotchas (multi-index headers, missing values), and the pattern for scraping protected sites via proxies.
2026-05-10
Python's requests library has no built-in retry. The standard pattern uses urllib3's Retry class mounted via an HTTPAdapter on a Session. This guide covers exponential backoff, status_forcelist, retry budgets, and when to retry vs fail fast.
2026-05-10
cURL has four ways to send a POST: -d for form-encoded data, --data-raw for raw bodies (JSON), --data-urlencode for safe URL encoding, and -F for multipart file uploads. This guide shows the exact flags, content-type behavior, and proxy setup for each pattern.
2026-05-10
DataDome is one of the toughest anti-bot services on the web (used by Reddit, Hermes, Allegro, RIU). It blocks at four layers: IP reputation, TLS fingerprint, browser fingerprint, and behavioral signals. This guide covers what beats it in 2026: residential or LTE mobile proxies, curl_cffi or undetected-chromedriver, and where you cannot avoid a CAPTCHA solver.
2026-05-10
Cloudscraper is a Python library that handles Cloudflare's basic JavaScript challenges so requests can fetch a page without spinning up a real browser. This tutorial covers when it works, when it does not, how to plug in residential proxies, and what to switch to (FlareSolverr or Playwright) for the harder Cloudflare modes.
2026-05-10
The 9 best proxy extensions for Chrome and Edge in 2026 ranked: FoxyProxy and SwitchyOmega for power users, Bright Data Proxy Manager for SaaS, and 6 more. Free vs paid, security audit (some inject ads or log traffic), and which extension fits each use case from casual to professional.
2026-05-06
Cheerio is a server-side jQuery-like HTML parser; Puppeteer is a headless Chrome controller. Cheerio is 50-100x faster for static HTML but can't run JavaScript. Puppeteer renders SPAs but uses 200x the memory. This guide covers when to pick which, performance benchmarks, and the hybrid pattern most production scrapers use.
2026-05-06
JavaScript has no native cURL. The three alternatives: built-in fetch (in browser and Node 18+), axios (cleaner API + interceptors), and child_process to invoke real curl from Node. This guide covers when each makes sense, working examples, and how to add proxy support to all three.
2026-05-06
HTTPX is the modern Python HTTP client — same API as requests but with HTTP/2, async support, and built-in connection pooling. This guide covers when to migrate from requests, performance comparison vs requests and aiohttp, and 8 working examples for sync/async/streaming/proxies.
2026-05-06
Three ways to handle cookies in Python requests: Session for automatic cross-request cookie management, dict for simple one-off cookies, and CookieJar for fine-grained control. Code examples for login authentication, persistent sessions, cookie inspection, and using cookies through proxies.
2026-05-06
Google Finance never had a public API and the unofficial endpoints stopped working years ago. This guide compares the 7 working alternatives in 2026: yfinance (free, Python), Alpha Vantage, IEX Cloud, Polygon.io, Twelve Data, Tiingo, and the option of scraping Yahoo Finance directly with residential proxies. Pricing, rate limits, data coverage, and code examples for each.
2026-05-04
A beginner Python tutorial for scraping text from any website: requests + BeautifulSoup for static pages, Playwright for JavaScript-rendered pages, the .get_text() method with separator and strip arguments, whitespace cleaning, and 6 working examples — plus when to add residential proxies for scaled scraping.
2026-05-04
PyQuery brings jQuery-style selectors to Python HTML parsing. This guide covers when to pick PyQuery over BeautifulSoup or lxml, the API differences, performance comparison on real pages, and 8 working examples for scraping. PyQuery wins when you're already comfortable with jQuery and want the same syntax server-side.
2026-05-04
Convert a Python string to JSON with json.loads(). This guide covers the basics, the 5 errors that hit 99% of beginners (single quotes, trailing commas, NaN, comments, bytes), how to handle them, schema validation with pydantic or jsonschema, and 8 copy-paste examples.
2026-05-04
Four ways to find HTML elements by class with BeautifulSoup: find_all(class_=), select() with CSS selectors, attrs={'class': ...}, and regex matching. Plus the gotchas around multi-class elements, dynamic Tailwind class names, and case sensitivity.
2026-05-01
Step-by-step Python tutorial for scraping Yellow Pages business listings: name, phone, address, ratings, hours, and category. Covers BeautifulSoup setup, pagination, residential proxy rotation, anti-bot tactics, CSV export, and the legal considerations for scraped business contact data.
2026-05-01
Crunchyroll's anime catalog varies by country — the US has more shows than the EU, and Japan has different ones again. To unblock the catalog you want, use a residential proxy in the right country. VPNs work less reliably than proxies because Crunchyroll's geo-block detection is tuned against shared VPN IP ranges.
2026-04-27
Use a proxy on Tinder to swipe in another city, avoid Tinder Passport fees, or run multiple accounts. Step-by-step iOS and Android setup, the 6 signals Tinder uses to detect suspicious sessions, and which proxy type works (mobile beats residential beats datacenter).
2026-04-27
VMs vs antidetect browsers for multi-accounting in 2026. Fingerprint isolation, cost, scalability, IP pairing, and which to pick for Amazon, eBay, Meta, and TikTok.
2026-04-25
How browser fingerprinting works: canvas, WebGL, fonts, TLS/JA4, user-agent entropy. Test your fingerprint and reduce uniqueness with proxies.
2026-04-24
WebRTC leaks expose your real IP behind a VPN or proxy. Learn how leaks happen, how to test, and how to block them on Chrome, Firefox, and Safari.
2026-04-24
Complete guide to what proxies are used for in 2026. The 15 most common proxy use cases — web scraping, SEO tracking, ad verification, sneaker bots, brand protection, AI training — and which proxy type fits each.
2026-04-21
What a DNS leak is, why it happens even on VPNs and proxies, how to test for one in under a minute, and the exact fixes for Windows, macOS, iOS, Android, routers, and browsers.
2026-04-20
Scrape Facebook Marketplace listings, prices, and sellers in 2026 — GraphQL endpoints, cookie / DTSG tokens, geo-rotation, and anti-bot countermeasures with working Python code.
2026-04-19
How to scrape TikTok in 2026 — web scraping vs mobile API, X-Bogus/msToken signatures, anti-bot detection, and which proxy type to pick for video, profile, and trend data without getting blocked.
2026-04-18
Step-by-step Python tutorial for scraping Reddit posts and comments. Covers JSON endpoints, the PRAW API, proxy rotation, rate-limit handling, and best practices to avoid bans.
2026-04-16
Step-by-step guide to configuring a proxy on macOS Sonoma, Sequoia, and Ventura. Covers system settings, Safari/Chrome/Firefox, terminal, SOCKS5, and troubleshooting.
2026-04-16
Step-by-step guide to configuring proxy settings on iPhone (iOS 17/18) and Android. Covers Wi-Fi proxy setup, apps, SOCKS5 configuration, and troubleshooting.
2026-04-10
Compare Puppeteer, Playwright, and Selenium for web scraping in 2026. Performance benchmarks, proxy support, browser coverage, and code examples to help you choose.
2026-04-10
Learn how to scrape websites behind login pages using Python. Covers session-based auth, Selenium browser login, Playwright, cookies, and proxy rotation for scale.
2026-04-10
Compare the 10 best proxy browsers and anti-detect browsers in 2026. Covers Multilogin, GoLogin, AdsPower, Dolphin Anty, Tor, Brave, and more — with proxy setup tips.
2026-04-09
Learn how to configure proxy settings in Chrome on Windows and macOS. Step-by-step instructions for system proxy, extensions, command-line flags, and SOCKS5 setup.
2026-04-09
Step-by-step guide to configuring proxy settings on Windows 11. Covers system settings, browser-specific setup, SOCKS5 configuration, and troubleshooting common issues.
2026-04-09
Complete guide to scraping Google SERPs with Python. Extract rankings, snippets, and People Also Ask data using rotating proxies to avoid blocks.
2026-04-07
Learn how to scrape Instagram data with Python. Extract profiles, posts, hashtags, and comments using proxies to avoid rate limits and IP bans.
2026-04-07
Learn how to scrape Zillow with Python using requests, BeautifulSoup, and Playwright. Complete guide with proxy rotation, anti-bot handling, and code examples.
2026-04-06
Integrate rotating proxies with Python requests. Code examples for proxy rotation, sticky sessions, SOCKS5, and async scraping.
2026-03-30
Complete guide to scraping Amazon in 2026. Learn proxy rotation, anti-detection techniques, Python code examples, and production-grade error handling.
2026-03-30
Getting the "There was a problem with the server [400]" error on YouTube? This step-by-step guide covers every fix for desktop browsers and mobile devices.
2026-03-27
Master cURL from basics to advanced proxy configuration. Learn installation, HTTP requests, authentication, proxy setup with residential and SOCKS5 proxies, and automation techniques.
2026-03-16
Learn how to diagnose and fix SSL certificate errors step by step. Covers expired certificates, name mismatches, self-signed certs, proxy-related SSL issues, and prevention strategies for 2026.
2026-03-16
Small online stores can use web scraping to track competitor prices, spot product trends, and gather customer insights. Learn how to scrape ethically with proxies, beginner-friendly tools, and actionable workflows.
2026-03-16
Learn how datacenter, residential, ISP, and mobile proxies enable large-scale data research. Compare proxy types, choose the right one for your project, and build reliable data collection pipelines.
2026-03-16
Master web scraping with proxies. Learn best practices for rotating IPs, handling rate limits, avoiding detection, and building reliable scraping pipelines.
2026-02-15