Quick verdict: Use requests + BeautifulSoup for static pages, Playwright for JavaScript-rendered ones. The text extraction method is .get_text(separator=' ', strip=True) on a properly-scoped element selector — never on the whole document. Add residential proxies when your IP starts getting rate-limited.
This guide covers the install, the static-page case (95% of real-world tutorials), the JavaScript case (the other 5% that trips up beginners), how to clean whitespace and skip navigation, and 6 working examples for common patterns.
pip install requests beautifulsoup4 lxml
import requests
from bs4 import BeautifulSoup
r = requests.get("https://en.wikipedia.org/wiki/Web_scraping")
soup = BeautifulSoup(r.text, "lxml")
print(soup.select_one("h1").get_text(strip=True))
That's the entire pipeline. Three more knobs you'll use:
get_text(separator=' ') — joins inline text with a separator instead of running together.get_text(strip=True) — strips leading/trailing whitespace from each text node..select_one(css) vs .select(css) — first match vs all matches with CSS selectors.If requests.get() returns HTML but you can't find the content you see in your browser, the page is JavaScript-rendered. requests doesn't run JS. Use Playwright:
pip install playwright
playwright install chromium
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/spa-app")
page.wait_for_selector("article") # wait for JS to render
html = page.content()
browser.close()
soup = BeautifulSoup(html, "lxml")
print(soup.select_one("article").get_text(separator=' ', strip=True))
Playwright is 10-100× slower than requests. Only use it when the page actually requires JS rendering.
paragraphs = [p.get_text(strip=True) for p in soup.select("article p")]
text = "
".join(paragraphs)
title = soup.select_one("h1").get_text(strip=True)
author = soup.select_one(".byline a").get_text(strip=True)
date = soup.select_one("time").get("datetime") # ISO format from attr
print(f"{title} by {author} ({date})")
for tag in soup(["script", "style", "nav", "footer"]):
tag.decompose() # remove from tree
visible_text = soup.get_text(separator=' ', strip=True)
main = soup.select_one("main")
# get_text on main keeps p, h1-6, but skips nav/footer outside main
content = main.get_text(separator='
', strip=True)
import time
all_text = []
for page in range(1, 11):
r = requests.get(f"https://example.com/articles?page={page}", timeout=20)
soup = BeautifulSoup(r.text, "lxml")
for art in soup.select("article"):
all_text.append(art.get_text(separator=' ', strip=True))
time.sleep(1) # respectful rate
proxies = {"https": "http://USER:[email protected]:8080"}
r = requests.get(url, proxies=proxies, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/130.0.0.0"
}, timeout=20)
soup = BeautifulSoup(r.text, "lxml")
text = soup.select_one("article").get_text(separator=' ', strip=True)
HTML often produces text like " Hello
world ". Two clean-up patterns:
import re
# Method 1: BeautifulSoup's built-in
text = soup.get_text(separator=' ', strip=True)
# Method 2: regex normalize all whitespace runs to single space
text = re.sub(r's+', ' ', raw_text).strip()
# Method 3: keep paragraph breaks but normalize within
text = re.sub(r'[ ]+', ' ', re.sub(r'
{3,}', '
', raw)).strip()
For one-off scraping (under 1,000 pages), your home IP is fine. For repeated scraping you'll hit rate limits. The threshold:
For scaled scraping, use rotating residential proxies with Python requests — see our rotating proxies with Python requests guide for the implementation pattern.
Public web text is generally scrapable for personal use under the hiQ v. LinkedIn precedent. Three lines that change the analysis: