spyderproxy

Pandas read_html: Scrape Tables Without BeautifulSoup

A

Alex R.

|
Published date

Sun May 10 2026

Quick verdict: pd.read_html(url) returns a list of DataFrames — one per <table> on the page. It is the fastest path from "URL with a table" to "DataFrame I can analyze." Limitations: only finds <table> tags (not divs styled like tables), no JavaScript rendering, no auth handling, and the page must be reachable from your IP. For protected sites, fetch the HTML via requests with a proxy and pass the HTML string to read_html.

First Example

pip install pandas lxml
import pandas as pd

tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue")
print(f"Found {len(tables)} tables")
df = tables[0]
print(df.head())
print(df.columns.tolist())

That fetches the Wikipedia article, parses every <table> tag, and returns a list of DataFrames. The first one has the rankings; subsequent ones are infoboxes/sidebars.

Parser Choice (lxml, html5lib, bs4)

Pandas tries parsers in this order: lxmlhtml5libbs4 + html.parser. Install lxml for speed:

pip install lxml html5lib beautifulsoup4

Force a specific parser:

tables = pd.read_html(url, flavor="bs4")

lxml is fastest. html5lib is slowest but most permissive (handles broken HTML). For Wikipedia and most well-formed sites, lxml works. For ad-hoc scraped HTML, html5lib may be needed.

Picking the Right Table

Pages usually have many tables. Filter to find yours:

# By a string in the table
tables = pd.read_html(url, match="Revenue")

# By the index of the table on the page
tables = pd.read_html(url)
df = tables[2]  # third table

match uses a regex against the table's text content. match="Revenue" returns only tables containing that word. Useful when the page changes layout and your hardcoded index breaks.

Many tables have multi-row headers (e.g., grouped columns: "2025 | 2026" with sub-headers "Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4"). Tell pandas:

df = pd.read_html(url, header=[0, 1])[0]
# df.columns is now a MultiIndex

header=[0, 1] means rows 0 and 1 are both headers. The result has a MultiIndex on columns:

df["2026", "Q1"]  # access via tuple

Thousands Separator + Decimals

European-formatted numbers (1.234,56 = 1,234.56 US-style) confuse pandas into reading them as strings:

df = pd.read_html(url, thousands=".", decimal=",")[0]

For currency with units: pandas does not strip $ or . Clean post-load:

df["Revenue"] = df["Revenue"].str.replace(r"[$,]", "", regex=True).astype(float)

Missing Values

Pandas reads empty cells as NaN by default. Customize what counts as missing:

df = pd.read_html(url, na_values=["N/A", "—", "n/a", ""])

Proxies: Pass HTML, Not URL

pd.read_html uses urllib internally and does not accept a proxy parameter. For proxied scraping, fetch the HTML separately with requests and pass the string:

import requests, pandas as pd

proxies = {
    "http":  "http://USER:[email protected]:8000",
    "https": "http://USER:[email protected]:8000",
}
headers = {"User-Agent": "Mozilla/5.0 (compatible; Researcher/1.0)"}

r = requests.get(
    "https://finance.yahoo.com/quote/AAPL/financials",
    proxies=proxies,
    headers=headers,
    timeout=15,
)
r.raise_for_status()

tables = pd.read_html(r.text)
income_statement = tables[0]

This pattern works for any protected/geo-locked source. For finance data specifically, see Google Finance API alternatives.

When Tables Are JavaScript-Rendered

pd.read_html only sees the initial HTML response. If the table loads via JavaScript (React/Vue dashboards, infinite-scroll tables), read_html finds nothing. Use Playwright to render first:

from playwright.sync_api import sync_playwright
import pandas as pd

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://dashboard.example.com/data")
    page.wait_for_selector("table")
    html = page.content()
    browser.close()

tables = pd.read_html(html)

For sites behind Cloudflare or DataDome, route Playwright through residential proxies. See Cloudscraper or DataDome bypass for the bypass setup.

Common Post-Processing

# Strip unicode whitespace
df.columns = df.columns.str.strip()
df = df.apply(lambda c: c.str.strip() if c.dtype == "object" else c)

# Drop fully-empty rows
df = df.dropna(how="all")

# Pin types
df["Revenue"] = pd.to_numeric(df["Revenue"], errors="coerce")

# Pivot or melt as needed
df_long = df.melt(id_vars="Company", var_name="Year", value_name="Revenue")

Common Errors

  • ValueError: No tables found — the page has no <table> tags (probably JS-rendered). Use Playwright.
  • HTTPError: 403 Forbidden — the site blocks default urllib User-Agent. Fetch via requests with a real UA.
  • ImportError: lxml not foundpip install lxml.
  • Wrong table returned — use match="some text in the right table" instead of integer index.
  • Garbled characters — encoding issue. Pass encoding="utf-8" or use requests + r.encoding = "utf-8"; pd.read_html(r.text).

Alternatives to read_html

ToolWhen to use
pandas.read_htmlStatic HTML, simple tables, fastest path
PyQueryComplex DOM queries, jQuery-like selectors
BeautifulSoupMaximum control, custom parsing logic
Playwright + read_htmlJS-rendered tables behind a real browser
scrapyCrawling many pages with built-in pipelines

Related: PyQuery tutorial, Python requests cookies, scrape text from any website.