spyderproxy

How to Scrape Text From Any Website (Python)

A

Alex R.

|
Published date

Mon May 04 2026

Quick verdict: Use requests + BeautifulSoup for static pages, Playwright for JavaScript-rendered ones. The text extraction method is .get_text(separator=' ', strip=True) on a properly-scoped element selector — never on the whole document. Add residential proxies when your IP starts getting rate-limited.

This guide covers the install, the static-page case (95% of real-world tutorials), the JavaScript case (the other 5% that trips up beginners), how to clean whitespace and skip navigation, and 6 working examples for common patterns.

Install

pip install requests beautifulsoup4 lxml

Static Pages — 5 Lines

import requests
from bs4 import BeautifulSoup

r = requests.get("https://en.wikipedia.org/wiki/Web_scraping")
soup = BeautifulSoup(r.text, "lxml")
print(soup.select_one("h1").get_text(strip=True))

That's the entire pipeline. Three more knobs you'll use:

  • get_text(separator=' ') — joins inline text with a separator instead of running together.
  • get_text(strip=True) — strips leading/trailing whitespace from each text node.
  • .select_one(css) vs .select(css) — first match vs all matches with CSS selectors.

JavaScript-Rendered Pages

If requests.get() returns HTML but you can't find the content you see in your browser, the page is JavaScript-rendered. requests doesn't run JS. Use Playwright:

pip install playwright
playwright install chromium
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/spa-app")
    page.wait_for_selector("article")  # wait for JS to render
    html = page.content()
    browser.close()

soup = BeautifulSoup(html, "lxml")
print(soup.select_one("article").get_text(separator=' ', strip=True))

Playwright is 10-100× slower than requests. Only use it when the page actually requires JS rendering.

6 Working Examples

1. All paragraph text from an article

paragraphs = [p.get_text(strip=True) for p in soup.select("article p")]
text = "

".join(paragraphs)

2. Article title + author + date

title = soup.select_one("h1").get_text(strip=True)
author = soup.select_one(".byline a").get_text(strip=True)
date = soup.select_one("time").get("datetime")  # ISO format from attr
print(f"{title} by {author} ({date})")

3. All visible text, excluding scripts/styles

for tag in soup(["script", "style", "nav", "footer"]):
    tag.decompose()  # remove from tree
visible_text = soup.get_text(separator=' ', strip=True)

4. Text from a specific section, skipping inline elements

main = soup.select_one("main")
# get_text on main keeps p, h1-6, but skips nav/footer outside main
content = main.get_text(separator='
', strip=True)

5. Pagination + collected text

import time
all_text = []
for page in range(1, 11):
    r = requests.get(f"https://example.com/articles?page={page}", timeout=20)
    soup = BeautifulSoup(r.text, "lxml")
    for art in soup.select("article"):
        all_text.append(art.get_text(separator=' ', strip=True))
    time.sleep(1)  # respectful rate

6. Through a residential proxy

proxies = {"https": "http://USER:[email protected]:8080"}
r = requests.get(url, proxies=proxies, headers={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/130.0.0.0"
}, timeout=20)
soup = BeautifulSoup(r.text, "lxml")
text = soup.select_one("article").get_text(separator=' ', strip=True)

Whitespace Cleanup

HTML often produces text like " Hello world ". Two clean-up patterns:

import re

# Method 1: BeautifulSoup's built-in
text = soup.get_text(separator=' ', strip=True)

# Method 2: regex normalize all whitespace runs to single space
text = re.sub(r's+', ' ', raw_text).strip()

# Method 3: keep paragraph breaks but normalize within
text = re.sub(r'[ 	]+', ' ', re.sub(r'
{3,}', '

', raw)).strip()

When You Need Proxies

For one-off scraping (under 1,000 pages), your home IP is fine. For repeated scraping you'll hit rate limits. The threshold:

  • Major retailers (Amazon, Walmart, eBay): proxies needed after ~50 requests
  • Social platforms (LinkedIn, Twitter, TikTok): proxies needed within minutes
  • News sites and blogs: usually fine without proxies for <1,000 requests
  • Cloudflare-protected sites: proxies + TLS fingerprint matching needed; see our Cloudflare bypass guide

For scaled scraping, use rotating residential proxies with Python requests — see our rotating proxies with Python requests guide for the implementation pattern.

Public web text is generally scrapable for personal use under the hiQ v. LinkedIn precedent. Three lines that change the analysis:

  • Login walls — bypassing authentication is a CFAA violation regardless of public data behind it.
  • Copyright — scraping is fine; republishing the scraped text as your own may not be.
  • Personal data — under GDPR, processing personal data (names, emails) requires a lawful basis even if the source is public.