spyderproxy

Web Scraping With PHP + Proxies: 3-Step Guide (2026)

D

Daniel K.

|
Published date

Sat May 16 2026

Quick verdict: Modern PHP web scraping is a three-line problem: composer require guzzlehttp/guzzle symfony/dom-crawler symfony/css-selector, then fetch with Guzzle (which handles proxies, retries, and concurrency out of the box) and parse with Symfony's DomCrawler (CSS selectors or XPath). Goutte is deprecated as of Symfony 6 — use DomCrawler directly. PHP scrapers run anywhere PHP runs (cheap shared hosts, Lambda, your existing WordPress box), and they're fast enough for anything that isn't a Frontier-Lab crawl.

Why PHP for Scraping?

  • Already in your stack. If your app, WordPress site, or Magento store is PHP, you don't need a second runtime.
  • Cheap to host. $5/mo shared hosting runs a PHP scraper. A 24/7 Python worker costs $20+/mo on the cheapest VPS.
  • Fast enough. PHP 8.3 with Guzzle + cURL is within ~30% of Python httpx for typical workloads.
  • Sync model is easy. No async/await mental tax until you actually need concurrency.

Step 1: Install Dependencies

composer require guzzlehttp/guzzle symfony/dom-crawler symfony/css-selector

That's the whole toolkit. Optionally add league/csv for CSV export and monolog/monolog for logging.

Step 2: Fetch With Guzzle (+ Proxy)

<?php
require __DIR__ . '/vendor/autoload.php';

use GuzzleHttp\Client;

$client = new Client([
    'base_uri'    => 'https://example.com',
    'timeout'     => 20.0,
    'http_errors' => false,
    'headers'     => [
        'User-Agent'      => 'Mozilla/5.0 (compatible; PHPScraper/1.0)',
        'Accept'          => 'text/html,application/xhtml+xml',
        'Accept-Language' => 'en-US,en;q=0.9',
    ],
    'proxy' => 'http://USER:[email protected]:8000',
]);

$response = $client->get('/products');
$status   = $response->getStatusCode();
$html     = (string) $response->getBody();

echo "HTTP $status, " . strlen($html) . " bytes\n";
?>

Guzzle's proxy option accepts a single URL or a protocol-keyed array if you want HTTP and HTTPS separately. For SOCKS5 use socks5://USER:PASS@host:port.

Step 3: Parse With Symfony DomCrawler

use Symfony\Component\DomCrawler\Crawler;

$crawler = new Crawler($html);

// CSS selectors (requires symfony/css-selector)
$titles = $crawler->filter('h2.product-title')->each(fn($node) => $node->text());
$prices = $crawler->filter('.price')->each(fn($node) => trim($node->text()));

// Link extraction
$links = $crawler->filter('a.product')->each(fn($node) => $node->attr('href'));

// XPath (no extra dependency)
$jsonld = $crawler->filterXPath("//script[@type='application/ld+json']")
                  ->each(fn($node) => $node->text());

foreach ($titles as $i => $title) {
    echo $title . ' — ' . ($prices[$i] ?? '?') . "\n";
}

Adding Pagination

$page = 1;
$all  = [];

while (true) {
    $resp = $client->get("/products?page=$page");
    if ($resp->getStatusCode() !== 200) break;

    $crawler = new Crawler((string) $resp->getBody());
    $items   = $crawler->filter('.product')->each(fn($n) => [
        'title' => $n->filter('h2')->text(),
        'price' => $n->filter('.price')->text(),
        'url'   => $n->filter('a')->attr('href'),
    ]);

    if (empty($items)) break;
    $all = array_merge($all, $items);
    $page++;
    usleep(500_000); // 500ms politeness delay
}

echo count($all) . " products scraped\n";

Concurrent Requests With Pool

Guzzle's Pool runs many requests in parallel without async/await:

use GuzzleHttp\Pool;
use GuzzleHttp\Psr7\Request;

$urls = ['/products?page=1', '/products?page=2', '/products?page=3', /* ... */];
$requests = function () use ($urls) {
    foreach ($urls as $u) yield new Request('GET', $u);
};

$results = [];
$pool = new Pool($client, $requests(), [
    'concurrency' => 10,
    'fulfilled'   => function ($response, $index) use (&$results, $urls) {
        $results[$urls[$index]] = (string) $response->getBody();
    },
    'rejected' => function ($reason, $index) use ($urls) {
        echo "FAIL $urls[$index]: $reason\n";
    },
]);

$pool->promise()->wait();
echo count($results) . " pages fetched concurrently\n";

Concurrency of 10–20 is the sweet spot on a Premium Residential gateway. Higher concurrency on shared hosting eats your PHP-FPM workers; raise pm.max_children if you go above 20.

Cookies + Sessions

use GuzzleHttp\Cookie\CookieJar;

$jar    = new CookieJar();
$client = new Client(['cookies' => $jar]);

// Login
$client->post('https://example.com/login', [
    'form_params' => ['user' => 'alice', 'pass' => 'secret'],
]);

// Subsequent requests reuse the session
$resp = $client->get('https://example.com/dashboard');

Handling Cloudflare + Anti-Bot

Guzzle is a pure HTTP client — it can't solve Cloudflare's JS challenges. Options:

  • Run FlareSolverr as a Docker sidecar; POST your URL to http://localhost:8191/v1 and Guzzle gets back the cleared HTML and cookies.
  • Use LTE Mobile proxies ($2/IP) — mobile IPs face the lightest Cloudflare scrutiny.
  • For targets behind Turnstile or PerimeterX, pair Guzzle with a captcha solver (comparison of captcha solvers).
// Calling FlareSolverr from PHP
$fs = new Client(['base_uri' => 'http://localhost:8191']);
$resp = $fs->post('/v1', [
    'json' => [
        'cmd'        => 'request.get',
        'url'        => 'https://target-with-cloudflare.com',
        'maxTimeout' => 60000,
    ],
]);
$data = json_decode((string) $resp->getBody(), true);
$html = $data['solution']['response'] ?? '';
$crawler = new Crawler($html);

Goutte: Deprecated, Use DomCrawler

If you've scraped PHP before 2023 you probably used Goutte. As of Symfony 6, Goutte was archived and its functionality merged into symfony/browser-kit + symfony/dom-crawler. The new pattern:

use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;

$browser = new HttpBrowser(HttpClient::create([
    'proxy' => 'http://USER:[email protected]:8000',
]));
$crawler = $browser->request('GET', 'https://example.com/products');
$titles  = $crawler->filter('h2.title')->each(fn($n) => $n->text());

HttpBrowser handles cookies, form submission, and link clicking like a tiny browser without the Playwright weight. For simple scrapes it's cleaner than raw Guzzle + Crawler.

Saving Data: JSON + CSV

// JSON
file_put_contents('products.json', json_encode($all, JSON_PRETTY_PRINT));

// CSV with league/csv
use League\Csv\Writer;

$writer = Writer::createFromPath('products.csv', 'w+');
$writer->insertOne(['title', 'price', 'url']);
foreach ($all as $row) {
    $writer->insertOne([$row['title'], $row['price'], $row['url']]);
}

Production Tips

  • Use a queue. Symfony Messenger or Laravel Queues for background jobs — never scrape inside a web request.
  • Retry transient errors. Guzzle's middleware: Middleware::retry(...) with exponential backoff. Retry on 429, 502, 503, 504.
  • Rotate User-Agent. Maintain a list of 10–20 modern UA strings and pick one per session.
  • Cache responses. Symfony HttpClient has a built-in cache decorator; for Guzzle use kevinrob/guzzle-cache-middleware.
  • Respect robots.txt. Parse with spatie/robots-txt before the first request.
  • Memory. DomCrawler keeps the full DOM in memory. For pages over a few MB, parse with simplexml streaming instead.

Common Errors

  • cURL error 7 "Couldn't connect" — proxy unreachable. Check proxy URL and that your firewall allows outbound to the proxy port.
  • cURL error 28 "Operation timed out" — bump timeout. Default 0 = no timeout, which is a bug-magnet.
  • Empty $crawler results — content is JS-rendered. Switch to FlareSolverr or Playwright via the PHP node bindings.
  • Random 403s — missing or fake User-Agent. Use a real browser UA, add Accept-Language, and rotate proxies.

Related: Web scraping in C · FlareSolverr guide · curl vs wget · Rotating proxies in Python.