spyderproxy

PyQuery Tutorial: HTML Parsing in Python

A

Alex R.

|
Published date

Mon May 04 2026

Quick verdict: PyQuery brings jQuery-style chaining to Python HTML parsing. Same use case as BeautifulSoup, different ergonomics. Pick PyQuery if you're porting a jQuery scraper or prefer chained selector syntax; pick BeautifulSoup for new Python-idiomatic code; pick raw lxml for max throughput. Performance is comparable across all three for typical scraping workloads.

This guide covers PyQuery's API, when to pick it over alternatives, performance benchmarks against BeautifulSoup and selectolax, and 8 working scraping examples.

Install

pip install pyquery

# Linux: also need lxml dependencies
sudo apt install libxml2-dev libxslt-dev

Basic Usage

from pyquery import PyQuery as pq
import requests

r = requests.get("https://example.com")
doc = pq(r.text)

# All h2 elements
print(doc("h2").text())

# All links
for a in doc("a"):
    href = pq(a).attr("href")
    text = pq(a).text()
    print(text, "—>", href)

# Chained selectors (jQuery-style)
doc(".article").find(".title").each(lambda i, el: print(pq(el).text()))

PyQuery vs BeautifulSoup vs lxml

Library Syntax style Speed (relative) Best for
PyQueryjQuery chaining~1.0×Porting jQuery code, readable selector chains
BeautifulSoup + lxmlPythonic methods~1.0×New Python projects, default choice
lxml directXPath / CSS~2-3×High-throughput scraping
selectolaxCSS selectors~5-10×Maximum-throughput batch processing

8 Working Examples

1. Extract all article titles

doc = pq(html)
titles = [pq(t).text() for t in doc("article h2.title")]

2. Get attribute value

img_src = doc("img.hero").attr("src")
all_links = [pq(a).attr("href") for a in doc("a")]

3. Multi-class selector

# Both classes required
items = doc(".product.featured")

# Either class
items = doc(".product, .featured")

4. Filter by attribute

# Links to external sites
external = doc("a[href^='http']").not_("a[href*='example.com']")

# Inputs of type "email"
emails = doc("input[type='email']")

5. XPath

# Get every h2 whose parent has class 'main'
elems = doc.xpath("//div[@class='main']//h2")

6. Iteration with .each()

def process(i, el):
    e = pq(el)
    print(i, e.find(".title").text(), e.find(".price").text())

doc(".product-card").each(process)

7. Modify HTML

doc("a").attr("rel", "nofollow")
doc("script").remove()
print(doc.outer_html())

8. Through a residential proxy

proxies = {"https": "http://USER:[email protected]:8080"}
r = requests.get("https://target.com", proxies=proxies, timeout=20)
doc = pq(r.text)
items = [pq(x).text() for x in doc(".item-title")]

For scaled scraping behind anti-bot defenses, use a rotating residential proxy at the request layer. PyQuery doesn't care which proxy is in use — it just parses what it receives.

When to Pick PyQuery

  • You're porting a jQuery-based scraper from Node or browser-side and want the same selector syntax.
  • Your team is more familiar with jQuery than with Python's iteration patterns.
  • You need both reading AND writing/modifying the DOM (HTML transformation pipelines).
  • You like chained method calls more than nested function calls.

Pick BeautifulSoup instead if: you're starting fresh in Python and want the most idiomatic syntax, you need its more lenient HTML parsing for malformed pages (html5lib parser), or you're following a tutorial that uses it.