Why does pd.read_html return an empty list?

Three reasons: 1) the page has no tags (probably JavaScript-rendered — use Playwright to fetch the HTML first). 2) the page returned an error (403, 503) — check r.status_code if fetching manually. 3) the parser failed silently — try flavor='html5lib' for permissive parsing.

Can pandas read_html handle authentication?

Not directly. For Basic auth or cookies, fetch the HTML via requests with auth/cookies set, then pass the HTML string to pd.read_html(html_string). pd.read_html only takes a URL or file/string — it has no auth parameter.

How do I scrape a table behind a login?

1) Use requests.Session to log in (POST to the login form, capture cookies). 2) GET the protected page using the session. 3) Pass the response text to pd.read_html. The session keeps cookies across requests automatically.

Does pd.read_html work with React or Vue tables?

No — only static HTML. JS-rendered tables need a real browser. Use Playwright or Selenium to render the page, get the final DOM with page.content(), then pass that HTML to pd.read_html.

How do I select the right table from many?

Three ways: 1) by index — pd.read_html(url)[2] returns the third table. 2) by content — pd.read_html(url, match='Revenue') returns only tables containing 'Revenue'. 3) by attribute — pd.read_html(url, attrs={'id': 'main-table'}) returns the table with that id.

Pandas read_html returns 403 — what do I do?

The site blocks pandas' default User-Agent (urllib). Fetch with requests using a real browser UA: r = requests.get(url, headers={'User-Agent': 'Mozilla/5.0...'}); pd.read_html(r.text). If still blocked, the site likely uses Cloudflare/DataDome — route through residential proxies.

How do I parse European number formats?

pd.read_html(url, thousands='.', decimal=',') for European format (1.234,56). For currencies with $ or €, the values come back as strings — clean post-load with df['col'].str.replace(r'[$,]', '', regex=True).astype(float).

Can I use pandas read_html with rotating proxies?

Yes, but indirectly. Fetch each page with requests using a different proxy per call (rotate via SpyderProxy session-id syntax), then pass each response text to pd.read_html(). Cache the resulting DataFrames before moving on.

Pandas read_html: Scrape Tables Without BeautifulSoup

Alex R.

Sun May 10 2026

Quick verdict: pd.read_html(url) returns a list of DataFrames — one per <table> on the page. It is the fastest path from "URL with a table" to "DataFrame I can analyze." Limitations: only finds <table> tags (not divs styled like tables), no JavaScript rendering, no auth handling, and the page must be reachable from your IP. For protected sites, fetch the HTML via requests with a proxy and pass the HTML string to read_html.

First Example

pip install pandas lxml

import pandas as pd

tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue")
print(f"Found {len(tables)} tables")
df = tables[0]
print(df.head())
print(df.columns.tolist())

That fetches the Wikipedia article, parses every <table> tag, and returns a list of DataFrames. The first one has the rankings; subsequent ones are infoboxes/sidebars.

Parser Choice (lxml, html5lib, bs4)

Pandas tries parsers in this order: lxml → html5lib → bs4 + html.parser. Install lxml for speed:

pip install lxml html5lib beautifulsoup4

Force a specific parser:

tables = pd.read_html(url, flavor="bs4")

lxml is fastest. html5lib is slowest but most permissive (handles broken HTML). For Wikipedia and most well-formed sites, lxml works. For ad-hoc scraped HTML, html5lib may be needed.

Picking the Right Table

Pages usually have many tables. Filter to find yours:

# By a string in the table
tables = pd.read_html(url, match="Revenue")

# By the index of the table on the page
tables = pd.read_html(url)
df = tables[2]  # third table

match uses a regex against the table's text content. match="Revenue" returns only tables containing that word. Useful when the page changes layout and your hardcoded index breaks.

Many tables have multi-row headers (e.g., grouped columns: "2025 | 2026" with sub-headers "Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4"). Tell pandas:

df = pd.read_html(url, header=[0, 1])[0]
# df.columns is now a MultiIndex

header=[0, 1] means rows 0 and 1 are both headers. The result has a MultiIndex on columns:

df["2026", "Q1"]  # access via tuple

Thousands Separator + Decimals

European-formatted numbers (1.234,56 = 1,234.56 US-style) confuse pandas into reading them as strings:

df = pd.read_html(url, thousands=".", decimal=",")[0]

For currency with units: pandas does not strip $ or €. Clean post-load:

df["Revenue"] = df["Revenue"].str.replace(r"[$,]", "", regex=True).astype(float)

Missing Values

Pandas reads empty cells as NaN by default. Customize what counts as missing:

df = pd.read_html(url, na_values=["N/A", "—", "n/a", ""])

Proxies: Pass HTML, Not URL

pd.read_html uses urllib internally and does not accept a proxy parameter. For proxied scraping, fetch the HTML separately with requests and pass the string:

import requests, pandas as pd

proxies = {
    "http":  "http://USER:[email protected]:8000",
    "https": "http://USER:[email protected]:8000",
}
headers = {"User-Agent": "Mozilla/5.0 (compatible; Researcher/1.0)"}

r = requests.get(
    "https://finance.yahoo.com/quote/AAPL/financials",
    proxies=proxies,
    headers=headers,
    timeout=15,
)
r.raise_for_status()

tables = pd.read_html(r.text)
income_statement = tables[0]

This pattern works for any protected/geo-locked source. For finance data specifically, see Google Finance API alternatives.

When Tables Are JavaScript-Rendered

pd.read_html only sees the initial HTML response. If the table loads via JavaScript (React/Vue dashboards, infinite-scroll tables), read_html finds nothing. Use Playwright to render first:

from playwright.sync_api import sync_playwright
import pandas as pd

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://dashboard.example.com/data")
    page.wait_for_selector("table")
    html = page.content()
    browser.close()

tables = pd.read_html(html)

For sites behind Cloudflare or DataDome, route Playwright through residential proxies. See Cloudscraper or DataDome bypass for the bypass setup.

Common Post-Processing

# Strip unicode whitespace
df.columns = df.columns.str.strip()
df = df.apply(lambda c: c.str.strip() if c.dtype == "object" else c)

# Drop fully-empty rows
df = df.dropna(how="all")

# Pin types
df["Revenue"] = pd.to_numeric(df["Revenue"], errors="coerce")

# Pivot or melt as needed
df_long = df.melt(id_vars="Company", var_name="Year", value_name="Revenue")

Common Errors

ValueError: No tables found — the page has no <table> tags (probably JS-rendered). Use Playwright.
HTTPError: 403 Forbidden — the site blocks default urllib User-Agent. Fetch via requests with a real UA.
ImportError: lxml not found — pip install lxml.
Wrong table returned — use match="some text in the right table" instead of integer index.
Garbled characters — encoding issue. Pass encoding="utf-8" or use requests + r.encoding = "utf-8"; pd.read_html(r.text).

Alternatives to read_html

Tool	When to use
pandas.read_html	Static HTML, simple tables, fastest path
PyQuery	Complex DOM queries, jQuery-like selectors
BeautifulSoup	Maximum control, custom parsing logic
Playwright + read_html	JS-rendered tables behind a real browser
scrapy	Crawling many pages with built-in pipelines