Quick verdict: pd.read_html(url) returns a list of DataFrames — one per <table> on the page. It is the fastest path from "URL with a table" to "DataFrame I can analyze." Limitations: only finds <table> tags (not divs styled like tables), no JavaScript rendering, no auth handling, and the page must be reachable from your IP. For protected sites, fetch the HTML via requests with a proxy and pass the HTML string to read_html.
pip install pandas lxmlimport pandas as pd
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue")
print(f"Found {len(tables)} tables")
df = tables[0]
print(df.head())
print(df.columns.tolist())That fetches the Wikipedia article, parses every <table> tag, and returns a list of DataFrames. The first one has the rankings; subsequent ones are infoboxes/sidebars.
Pandas tries parsers in this order: lxml → html5lib → bs4 + html.parser. Install lxml for speed:
pip install lxml html5lib beautifulsoup4Force a specific parser:
tables = pd.read_html(url, flavor="bs4")lxml is fastest. html5lib is slowest but most permissive (handles broken HTML). For Wikipedia and most well-formed sites, lxml works. For ad-hoc scraped HTML, html5lib may be needed.
Pages usually have many tables. Filter to find yours:
# By a string in the table
tables = pd.read_html(url, match="Revenue")
# By the index of the table on the page
tables = pd.read_html(url)
df = tables[2] # third tablematch uses a regex against the table's text content. match="Revenue" returns only tables containing that word. Useful when the page changes layout and your hardcoded index breaks.
Many tables have multi-row headers (e.g., grouped columns: "2025 | 2026" with sub-headers "Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4"). Tell pandas:
df = pd.read_html(url, header=[0, 1])[0]
# df.columns is now a MultiIndexheader=[0, 1] means rows 0 and 1 are both headers. The result has a MultiIndex on columns:
df["2026", "Q1"] # access via tupleEuropean-formatted numbers (1.234,56 = 1,234.56 US-style) confuse pandas into reading them as strings:
df = pd.read_html(url, thousands=".", decimal=",")[0]For currency with units: pandas does not strip $ or €. Clean post-load:
df["Revenue"] = df["Revenue"].str.replace(r"[$,]", "", regex=True).astype(float)Pandas reads empty cells as NaN by default. Customize what counts as missing:
df = pd.read_html(url, na_values=["N/A", "—", "n/a", ""])pd.read_html uses urllib internally and does not accept a proxy parameter. For proxied scraping, fetch the HTML separately with requests and pass the string:
import requests, pandas as pd
proxies = {
"http": "http://USER:[email protected]:8000",
"https": "http://USER:[email protected]:8000",
}
headers = {"User-Agent": "Mozilla/5.0 (compatible; Researcher/1.0)"}
r = requests.get(
"https://finance.yahoo.com/quote/AAPL/financials",
proxies=proxies,
headers=headers,
timeout=15,
)
r.raise_for_status()
tables = pd.read_html(r.text)
income_statement = tables[0]This pattern works for any protected/geo-locked source. For finance data specifically, see Google Finance API alternatives.
pd.read_html only sees the initial HTML response. If the table loads via JavaScript (React/Vue dashboards, infinite-scroll tables), read_html finds nothing. Use Playwright to render first:
from playwright.sync_api import sync_playwright
import pandas as pd
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://dashboard.example.com/data")
page.wait_for_selector("table")
html = page.content()
browser.close()
tables = pd.read_html(html)For sites behind Cloudflare or DataDome, route Playwright through residential proxies. See Cloudscraper or DataDome bypass for the bypass setup.
# Strip unicode whitespace
df.columns = df.columns.str.strip()
df = df.apply(lambda c: c.str.strip() if c.dtype == "object" else c)
# Drop fully-empty rows
df = df.dropna(how="all")
# Pin types
df["Revenue"] = pd.to_numeric(df["Revenue"], errors="coerce")
# Pivot or melt as needed
df_long = df.melt(id_vars="Company", var_name="Year", value_name="Revenue")ValueError: No tables found — the page has no <table> tags (probably JS-rendered). Use Playwright.HTTPError: 403 Forbidden — the site blocks default urllib User-Agent. Fetch via requests with a real UA.ImportError: lxml not found — pip install lxml.match="some text in the right table" instead of integer index.encoding="utf-8" or use requests + r.encoding = "utf-8"; pd.read_html(r.text).| Tool | When to use |
|---|---|
| pandas.read_html | Static HTML, simple tables, fastest path |
| PyQuery | Complex DOM queries, jQuery-like selectors |
| BeautifulSoup | Maximum control, custom parsing logic |
| Playwright + read_html | JS-rendered tables behind a real browser |
| scrapy | Crawling many pages with built-in pipelines |
Related: PyQuery tutorial, Python requests cookies, scrape text from any website.