Quick verdict: PyQuery brings jQuery-style chaining to Python HTML parsing. Same use case as BeautifulSoup, different ergonomics. Pick PyQuery if you're porting a jQuery scraper or prefer chained selector syntax; pick BeautifulSoup for new Python-idiomatic code; pick raw lxml for max throughput. Performance is comparable across all three for typical scraping workloads.
This guide covers PyQuery's API, when to pick it over alternatives, performance benchmarks against BeautifulSoup and selectolax, and 8 working scraping examples.
pip install pyquery
# Linux: also need lxml dependencies
sudo apt install libxml2-dev libxslt-dev
from pyquery import PyQuery as pq
import requests
r = requests.get("https://example.com")
doc = pq(r.text)
# All h2 elements
print(doc("h2").text())
# All links
for a in doc("a"):
href = pq(a).attr("href")
text = pq(a).text()
print(text, "—>", href)
# Chained selectors (jQuery-style)
doc(".article").find(".title").each(lambda i, el: print(pq(el).text()))
| Library | Syntax style | Speed (relative) | Best for |
|---|---|---|---|
| PyQuery | jQuery chaining | ~1.0× | Porting jQuery code, readable selector chains |
| BeautifulSoup + lxml | Pythonic methods | ~1.0× | New Python projects, default choice |
| lxml direct | XPath / CSS | ~2-3× | High-throughput scraping |
| selectolax | CSS selectors | ~5-10× | Maximum-throughput batch processing |
doc = pq(html)
titles = [pq(t).text() for t in doc("article h2.title")]
img_src = doc("img.hero").attr("src")
all_links = [pq(a).attr("href") for a in doc("a")]
# Both classes required
items = doc(".product.featured")
# Either class
items = doc(".product, .featured")
# Links to external sites
external = doc("a[href^='http']").not_("a[href*='example.com']")
# Inputs of type "email"
emails = doc("input[type='email']")
# Get every h2 whose parent has class 'main'
elems = doc.xpath("//div[@class='main']//h2")
def process(i, el):
e = pq(el)
print(i, e.find(".title").text(), e.find(".price").text())
doc(".product-card").each(process)
doc("a").attr("rel", "nofollow")
doc("script").remove()
print(doc.outer_html())
proxies = {"https": "http://USER:[email protected]:8080"}
r = requests.get("https://target.com", proxies=proxies, timeout=20)
doc = pq(r.text)
items = [pq(x).text() for x in doc(".item-title")]
For scaled scraping behind anti-bot defenses, use a rotating residential proxy at the request layer. PyQuery doesn't care which proxy is in use — it just parses what it receives.
Pick BeautifulSoup instead if: you're starting fresh in Python and want the most idiomatic syntax, you need its more lenient HTML parsing for malformed pages (html5lib parser), or you're following a tutorial that uses it.