Quick verdict: Modern PHP web scraping is a three-line problem: composer require guzzlehttp/guzzle symfony/dom-crawler symfony/css-selector, then fetch with Guzzle (which handles proxies, retries, and concurrency out of the box) and parse with Symfony's DomCrawler (CSS selectors or XPath). Goutte is deprecated as of Symfony 6 — use DomCrawler directly. PHP scrapers run anywhere PHP runs (cheap shared hosts, Lambda, your existing WordPress box), and they're fast enough for anything that isn't a Frontier-Lab crawl.
composer require guzzlehttp/guzzle symfony/dom-crawler symfony/css-selector
That's the whole toolkit. Optionally add league/csv for CSV export and monolog/monolog for logging.
<?php
require __DIR__ . '/vendor/autoload.php';
use GuzzleHttp\Client;
$client = new Client([
'base_uri' => 'https://example.com',
'timeout' => 20.0,
'http_errors' => false,
'headers' => [
'User-Agent' => 'Mozilla/5.0 (compatible; PHPScraper/1.0)',
'Accept' => 'text/html,application/xhtml+xml',
'Accept-Language' => 'en-US,en;q=0.9',
],
'proxy' => 'http://USER:[email protected]:8000',
]);
$response = $client->get('/products');
$status = $response->getStatusCode();
$html = (string) $response->getBody();
echo "HTTP $status, " . strlen($html) . " bytes\n";
?>
Guzzle's proxy option accepts a single URL or a protocol-keyed array if you want HTTP and HTTPS separately. For SOCKS5 use socks5://USER:PASS@host:port.
use Symfony\Component\DomCrawler\Crawler;
$crawler = new Crawler($html);
// CSS selectors (requires symfony/css-selector)
$titles = $crawler->filter('h2.product-title')->each(fn($node) => $node->text());
$prices = $crawler->filter('.price')->each(fn($node) => trim($node->text()));
// Link extraction
$links = $crawler->filter('a.product')->each(fn($node) => $node->attr('href'));
// XPath (no extra dependency)
$jsonld = $crawler->filterXPath("//script[@type='application/ld+json']")
->each(fn($node) => $node->text());
foreach ($titles as $i => $title) {
echo $title . ' — ' . ($prices[$i] ?? '?') . "\n";
}
$page = 1;
$all = [];
while (true) {
$resp = $client->get("/products?page=$page");
if ($resp->getStatusCode() !== 200) break;
$crawler = new Crawler((string) $resp->getBody());
$items = $crawler->filter('.product')->each(fn($n) => [
'title' => $n->filter('h2')->text(),
'price' => $n->filter('.price')->text(),
'url' => $n->filter('a')->attr('href'),
]);
if (empty($items)) break;
$all = array_merge($all, $items);
$page++;
usleep(500_000); // 500ms politeness delay
}
echo count($all) . " products scraped\n";
Guzzle's Pool runs many requests in parallel without async/await:
use GuzzleHttp\Pool;
use GuzzleHttp\Psr7\Request;
$urls = ['/products?page=1', '/products?page=2', '/products?page=3', /* ... */];
$requests = function () use ($urls) {
foreach ($urls as $u) yield new Request('GET', $u);
};
$results = [];
$pool = new Pool($client, $requests(), [
'concurrency' => 10,
'fulfilled' => function ($response, $index) use (&$results, $urls) {
$results[$urls[$index]] = (string) $response->getBody();
},
'rejected' => function ($reason, $index) use ($urls) {
echo "FAIL $urls[$index]: $reason\n";
},
]);
$pool->promise()->wait();
echo count($results) . " pages fetched concurrently\n";
Concurrency of 10–20 is the sweet spot on a Premium Residential gateway. Higher concurrency on shared hosting eats your PHP-FPM workers; raise pm.max_children if you go above 20.
use GuzzleHttp\Cookie\CookieJar;
$jar = new CookieJar();
$client = new Client(['cookies' => $jar]);
// Login
$client->post('https://example.com/login', [
'form_params' => ['user' => 'alice', 'pass' => 'secret'],
]);
// Subsequent requests reuse the session
$resp = $client->get('https://example.com/dashboard');
Guzzle is a pure HTTP client — it can't solve Cloudflare's JS challenges. Options:
http://localhost:8191/v1 and Guzzle gets back the cleared HTML and cookies.// Calling FlareSolverr from PHP
$fs = new Client(['base_uri' => 'http://localhost:8191']);
$resp = $fs->post('/v1', [
'json' => [
'cmd' => 'request.get',
'url' => 'https://target-with-cloudflare.com',
'maxTimeout' => 60000,
],
]);
$data = json_decode((string) $resp->getBody(), true);
$html = $data['solution']['response'] ?? '';
$crawler = new Crawler($html);
If you've scraped PHP before 2023 you probably used Goutte. As of Symfony 6, Goutte was archived and its functionality merged into symfony/browser-kit + symfony/dom-crawler. The new pattern:
use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;
$browser = new HttpBrowser(HttpClient::create([
'proxy' => 'http://USER:[email protected]:8000',
]));
$crawler = $browser->request('GET', 'https://example.com/products');
$titles = $crawler->filter('h2.title')->each(fn($n) => $n->text());
HttpBrowser handles cookies, form submission, and link clicking like a tiny browser without the Playwright weight. For simple scrapes it's cleaner than raw Guzzle + Crawler.
// JSON
file_put_contents('products.json', json_encode($all, JSON_PRETTY_PRINT));
// CSV with league/csv
use League\Csv\Writer;
$writer = Writer::createFromPath('products.csv', 'w+');
$writer->insertOne(['title', 'price', 'url']);
foreach ($all as $row) {
$writer->insertOne([$row['title'], $row['price'], $row['url']]);
}
Middleware::retry(...) with exponential backoff. Retry on 429, 502, 503, 504.kevinrob/guzzle-cache-middleware.spatie/robots-txt before the first request.simplexml streaming instead.proxy URL and that your firewall allows outbound to the proxy port.timeout. Default 0 = no timeout, which is a bug-magnet.$crawler results — content is JS-rendered. Switch to FlareSolverr or Playwright via the PHP node bindings.Related: Web scraping in C · FlareSolverr guide · curl vs wget · Rotating proxies in Python.