Quick verdict: XPath wins over CSS selectors when you need text matching (contains(text(),...)), traversal up the DOM (parent::, ancestor::), or sibling logic (following-sibling::). CSS wins for class/id selection (.foo, #bar) and is faster in browsers. For scraping, learn both — you will use XPath ~30% of the time when CSS cannot do the job.
| Expression | Meaning |
|---|---|
/ | Root |
// | Anywhere in the document |
. | Current node |
.. | Parent |
* | Any element |
@ | Attribute |
text() | Text content of node |
node() | Any node (element + text + comment) |
# Every tag anywhere
//a
# tags inside
//div[@class="results"]//a
# Element with specific id
//*[@id="main"]
# Element with class containing "btn-primary"
//*[contains(@class, "btn-primary")]
# with href attribute
//a[@href]
# whose href starts with "/blog/"
//a[starts-with(@href, "/blog/")]
# with exact text
//h1[text()="Welcome"]
# containing "Welcome" (substring)
//h1[contains(text(), "Welcome")]
# Third
in any
//ul/li[3]
# Last -
//ul/li[last()]
# All but first
//ul/li[position() > 1]
Predicates (Filters in [])
Predicates filter nodes:
# Multiple conditions (AND)
//a[@href and @target="_blank"]
# OR
//a[@target="_blank" or @rel="noopener"]
# Negation
//a[not(@target="_blank")]
# Comparison
//tr[position() > 1]
//product[@price >= 100 and @price < 200]
# Text predicate
//button[text()="Submit"]
//div[contains(., "$99")] # . = string value of node (text + descendants)
Axes (Tree Traversal)
The killer XPath feature CSS does not have. Format: axis::node-test[predicate].
Axis Selects parent::Parent node ancestor::All ancestors ancestor-or-self::Self + all ancestors child::Direct children (default axis) descendant::All descendants descendant-or-self::Self + all descendants (this is what // means) following::Everything after current in document order following-sibling::Siblings after current preceding::Everything before current in document order preceding-sibling::Siblings before current self::The current node attribute::Attributes (shorthand: @)
Examples:
# From a , get the parent
//span[@class="price"]/parent::div
# From a label, get the next input (sibling)
//label[text()="Email"]/following-sibling::input[1]
# All tags after the
//h2[@id="news"]/following-sibling::p
# The
ancestor of a
//td[contains(text(), "Total")]/ancestor::table[1]
# Self with predicate (rare but legal)
//div[@class="card"]/self::*[contains(., "Sale")]String & Number Functions
Function Use contains(s1, s2)True if s1 contains s2 starts-with(s1, s2)True if s1 starts with s2 ends-with(s1, s2)XPath 2.0+ only (Python lxml: NO) normalize-space(s)Trim + collapse whitespace string-length(s)Length substring(s, start, len)1-indexed substring substring-before(s, sep)Text before separator substring-after(s, sep)Text after separator translate(s, from, to)Character-by-character map (poor man's lowercase) lower-case(s)XPath 2.0+ only count(nodeset)Number of matched nodes position()Index of current node in match set last()Index of last node in match set
Case-insensitive matching (XPath 1.0 lacks lower-case):
//a[contains(translate(text(), "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"), "submit")]
This is hideous but it is the XPath 1.0 way. lxml in Python uses XPath 1.0 by default; libxml2 supports 2.0 with a flag.
Real Scraping Examples
Extract all article titles where the publication date is 2026:
//article[contains(.//time/@datetime, "2026")]//h2/text()
Extract data-href from cards that contain a "Sold" badge:
//div[@class="card" and .//span[@class="badge sold"]]/@data-href
Get the price next to a label "Total:":
//*[text()="Total:"]/following-sibling::*[1]/text()
Get the value of the input that follows a label:
//label[text()="Email"]/following::input[1]/@value
(following:: is broader than following-sibling:: — it catches inputs in different parent nodes too.)
Python lxml Quickstart
from lxml import html, etree
import requests
r = requests.get("https://example.com")
tree = html.fromstring(r.content)
# Single result
title = tree.xpath("//h1/text()")[0]
# Multiple results
links = tree.xpath("//a/@href")
# Element nodes (not just text)
cards = tree.xpath('//div[contains(@class, "card")]')
for c in cards:
title = c.xpath(".//h2/text()")[0] # NOTE the leading dot
print(title)
Critical: when XPathing within an element, prefix with . to scope to the element. Without the dot, //h2 means "search the whole document," not "search inside this element."
XPath vs CSS Selectors
Need XPath CSS By id //*[@id="x"]#xBy class //*[contains(@class, "x")].xBy attribute //a[@href]a[href]Descendant //div//adiv aDirect child /div/adiv > anth-child //ul/li[3]ul li:nth-child(3)Adjacent sibling //h2/following-sibling::p[1]h2 + pText match //a[contains(text(), "Sign up")](impossible) Parent //span[@class="x"]/parent::div(impossible) Ancestor //x/ancestor::form[1](impossible)
For scraping, expect to use XPath ~30-40% of the time when text or upward-traversal is needed.
Testing XPath in Browser DevTools
Open DevTools console:
$x("//h1/text()")
$x() is built into Chrome and Firefox DevTools. Returns matching nodes. Use it to iterate on selectors before pasting into your scraper.
Related: PyQuery (jQuery-like CSS scraping), Cheerio vs Puppeteer, Scrape text from any website.