What's the best PHP library for web scraping in 2026?

Guzzle for HTTP requests, Symfony DomCrawler + CSS Selector for parsing. Goutte (the old all-in-one) was archived in 2023 and merged into Symfony's BrowserKit + DomCrawler. For full browser-like behavior (cookies, forms, link clicking) use Symfony HttpBrowser. Install with: composer require guzzlehttp/guzzle symfony/dom-crawler symfony/css-selector.

How do I use a proxy with Guzzle?

Pass the 'proxy' option to the Client constructor or per-request: new Client(['proxy' => 'http://USER:PASS@gw.spyderproxy.com:8000']). For protocol-specific routing use an array: ['proxy' => ['http' => 'http://...', 'https' => 'http://...']]. For SOCKS5 use the socks5:// scheme. Guzzle delegates to libcurl, so any cURL-supported proxy works.

Can PHP scrape JavaScript-rendered sites?

Not natively — Guzzle is a pure HTTP client and doesn't execute JS. Three options: (1) run FlareSolverr or a headless Chrome via Browserless as a sidecar and POST URLs to it from PHP, (2) use the chrome-php/chrome library to drive Chromium directly, (3) call a Node.js Playwright service from PHP via HTTP. For most cases FlareSolverr is the cleanest.

Is PHP fast enough for web scraping?

Yes for most use cases. PHP 8.3 with Guzzle + cURL hits ~250-350 pages/sec with Pool concurrency of 10-20 on commodity hardware. That's about 70% of equivalent Python httpx-async and well above what most scraping projects need. The bottleneck is usually the proxy gateway, not PHP itself.

How do I scrape multiple pages concurrently in PHP?

Use Guzzle's Pool class with a generator: $pool = new Pool($client, $requestsGenerator, ['concurrency' => 10, 'fulfilled' => callback]). The Pool runs requests in parallel via libcurl-multi without async/await. 10-20 concurrent is the sweet spot for most residential gateways.

Should I still use Goutte?

No. Goutte was archived in 2023. Its functionality moved into symfony/browser-kit (HttpBrowser class) and symfony/dom-crawler. New code should use HttpBrowser for browser-like flows or Guzzle + DomCrawler directly. Old Goutte code still works but won't receive security updates.

How do I bypass Cloudflare with PHP?

PHP alone cannot solve Cloudflare's modern JS challenges. The standard workaround in 2026: run FlareSolverr as a Docker container and POST your target URL to it from PHP, getting back the cleared HTML and cf_clearance cookie. For Turnstile or aggressive Bot Fight Mode, pair with a residential or LTE mobile proxy pool for clean IPs.

Can I scrape with PHP on shared hosting?

Yes, with caveats. Most shared hosts allow Composer, Guzzle, and outbound HTTPS. Watch for: (1) execution-time limits — use set_time_limit(0) or move to a CLI cron job, (2) memory limits — bump ini_set('memory_limit', '256M') for large DOMs, (3) PHP-FPM worker limits — never scrape inside a web request; use a CLI script triggered by cron.

Web Scraping With PHP + Proxies: 3-Step Guide (2026)

Daniel K.

Sat May 16 2026

Quick verdict: Modern PHP web scraping is a three-line problem: composer require guzzlehttp/guzzle symfony/dom-crawler symfony/css-selector, then fetch with Guzzle (which handles proxies, retries, and concurrency out of the box) and parse with Symfony's DomCrawler (CSS selectors or XPath). Goutte is deprecated as of Symfony 6 — use DomCrawler directly. PHP scrapers run anywhere PHP runs (cheap shared hosts, Lambda, your existing WordPress box), and they're fast enough for anything that isn't a Frontier-Lab crawl.

Why PHP for Scraping?

Already in your stack. If your app, WordPress site, or Magento store is PHP, you don't need a second runtime.
Cheap to host. $5/mo shared hosting runs a PHP scraper. A 24/7 Python worker costs $20+/mo on the cheapest VPS.
Fast enough. PHP 8.3 with Guzzle + cURL is within ~30% of Python httpx for typical workloads.
Sync model is easy. No async/await mental tax until you actually need concurrency.

Step 1: Install Dependencies

composer require guzzlehttp/guzzle symfony/dom-crawler symfony/css-selector

That's the whole toolkit. Optionally add league/csv for CSV export and monolog/monolog for logging.

Step 2: Fetch With Guzzle (+ Proxy)

<?php
require __DIR__ . '/vendor/autoload.php';

use GuzzleHttp\Client;

$client = new Client([
    'base_uri'    => 'https://example.com',
    'timeout'     => 20.0,
    'http_errors' => false,
    'headers'     => [
        'User-Agent'      => 'Mozilla/5.0 (compatible; PHPScraper/1.0)',
        'Accept'          => 'text/html,application/xhtml+xml',
        'Accept-Language' => 'en-US,en;q=0.9',
    ],
    'proxy' => 'http://USER:[email protected]:8000',
]);

$response = $client->get('/products');
$status   = $response->getStatusCode();
$html     = (string) $response->getBody();

echo "HTTP $status, " . strlen($html) . " bytes\n";
?>

Guzzle's proxy option accepts a single URL or a protocol-keyed array if you want HTTP and HTTPS separately. For SOCKS5 use socks5://USER:PASS@host:port.

Step 3: Parse With Symfony DomCrawler

use Symfony\Component\DomCrawler\Crawler;

$crawler = new Crawler($html);

// CSS selectors (requires symfony/css-selector)
$titles = $crawler->filter('h2.product-title')->each(fn($node) => $node->text());
$prices = $crawler->filter('.price')->each(fn($node) => trim($node->text()));

// Link extraction
$links = $crawler->filter('a.product')->each(fn($node) => $node->attr('href'));

// XPath (no extra dependency)
$jsonld = $crawler->filterXPath("//script[@type='application/ld+json']")
                  ->each(fn($node) => $node->text());

foreach ($titles as $i => $title) {
    echo $title . ' — ' . ($prices[$i] ?? '?') . "\n";
}

Adding Pagination

$page = 1;
$all  = [];

while (true) {
    $resp = $client->get("/products?page=$page");
    if ($resp->getStatusCode() !== 200) break;

    $crawler = new Crawler((string) $resp->getBody());
    $items   = $crawler->filter('.product')->each(fn($n) => [
        'title' => $n->filter('h2')->text(),
        'price' => $n->filter('.price')->text(),
        'url'   => $n->filter('a')->attr('href'),
    ]);

    if (empty($items)) break;
    $all = array_merge($all, $items);
    $page++;
    usleep(500_000); // 500ms politeness delay
}

echo count($all) . " products scraped\n";

Concurrent Requests With Pool

Guzzle's Pool runs many requests in parallel without async/await:

use GuzzleHttp\Pool;
use GuzzleHttp\Psr7\Request;

$urls = ['/products?page=1', '/products?page=2', '/products?page=3', /* ... */];
$requests = function () use ($urls) {
    foreach ($urls as $u) yield new Request('GET', $u);
};

$results = [];
$pool = new Pool($client, $requests(), [
    'concurrency' => 10,
    'fulfilled'   => function ($response, $index) use (&$results, $urls) {
        $results[$urls[$index]] = (string) $response->getBody();
    },
    'rejected' => function ($reason, $index) use ($urls) {
        echo "FAIL $urls[$index]: $reason\n";
    },
]);

$pool->promise()->wait();
echo count($results) . " pages fetched concurrently\n";

Concurrency of 10–20 is the sweet spot on a Premium Residential gateway. Higher concurrency on shared hosting eats your PHP-FPM workers; raise pm.max_children if you go above 20.

Cookies + Sessions

use GuzzleHttp\Cookie\CookieJar;

$jar    = new CookieJar();
$client = new Client(['cookies' => $jar]);

// Login
$client->post('https://example.com/login', [
    'form_params' => ['user' => 'alice', 'pass' => 'secret'],
]);

// Subsequent requests reuse the session
$resp = $client->get('https://example.com/dashboard');

Handling Cloudflare + Anti-Bot

Guzzle is a pure HTTP client — it can't solve Cloudflare's JS challenges. Options:

Run FlareSolverr as a Docker sidecar; POST your URL to http://localhost:8191/v1 and Guzzle gets back the cleared HTML and cookies.
Use LTE Mobile proxies ($2/IP) — mobile IPs face the lightest Cloudflare scrutiny.
For targets behind Turnstile or PerimeterX, pair Guzzle with a captcha solver (comparison of captcha solvers).

// Calling FlareSolverr from PHP
$fs = new Client(['base_uri' => 'http://localhost:8191']);
$resp = $fs->post('/v1', [
    'json' => [
        'cmd'        => 'request.get',
        'url'        => 'https://target-with-cloudflare.com',
        'maxTimeout' => 60000,
    ],
]);
$data = json_decode((string) $resp->getBody(), true);
$html = $data['solution']['response'] ?? '';
$crawler = new Crawler($html);

Goutte: Deprecated, Use DomCrawler

If you've scraped PHP before 2023 you probably used Goutte. As of Symfony 6, Goutte was archived and its functionality merged into symfony/browser-kit + symfony/dom-crawler. The new pattern:

use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;

$browser = new HttpBrowser(HttpClient::create([
    'proxy' => 'http://USER:[email protected]:8000',
]));
$crawler = $browser->request('GET', 'https://example.com/products');
$titles  = $crawler->filter('h2.title')->each(fn($n) => $n->text());

HttpBrowser handles cookies, form submission, and link clicking like a tiny browser without the Playwright weight. For simple scrapes it's cleaner than raw Guzzle + Crawler.

Saving Data: JSON + CSV

// JSON
file_put_contents('products.json', json_encode($all, JSON_PRETTY_PRINT));

// CSV with league/csv
use League\Csv\Writer;

$writer = Writer::createFromPath('products.csv', 'w+');
$writer->insertOne(['title', 'price', 'url']);
foreach ($all as $row) {
    $writer->insertOne([$row['title'], $row['price'], $row['url']]);
}

Production Tips

Use a queue. Symfony Messenger or Laravel Queues for background jobs — never scrape inside a web request.
Retry transient errors. Guzzle's middleware: Middleware::retry(...) with exponential backoff. Retry on 429, 502, 503, 504.
Rotate User-Agent. Maintain a list of 10–20 modern UA strings and pick one per session.
Cache responses. Symfony HttpClient has a built-in cache decorator; for Guzzle use kevinrob/guzzle-cache-middleware.
Respect robots.txt. Parse with spatie/robots-txt before the first request.
Memory. DomCrawler keeps the full DOM in memory. For pages over a few MB, parse with simplexml streaming instead.

Common Errors

cURL error 7 "Couldn't connect" — proxy unreachable. Check proxy URL and that your firewall allows outbound to the proxy port.
cURL error 28 "Operation timed out" — bump timeout. Default 0 = no timeout, which is a bug-magnet.
Empty $crawler results — content is JS-rendered. Switch to FlareSolverr or Playwright via the PHP node bindings.
Random 403s — missing or fake User-Agent. Use a real browser UA, add Accept-Language, and rotate proxies.