What's the simplest way to scrape text from a website with Python?

Two libraries: requests (fetches HTML) and BeautifulSoup (extracts text). Three lines: import them, requests.get(url) to fetch, BeautifulSoup(r.text, 'lxml').get_text() to extract. For most static websites this works in 5 minutes.

Why doesn't my Python scraper see the text I see in my browser?

Almost always because the page is JavaScript-rendered. Browsers run JS to populate the page; requests doesn't. Use Playwright or Selenium to render the page first, then extract. Telltale sign: 'view source' in your browser shows different content than 'inspect element'.

What's the difference between .text and .get_text() in BeautifulSoup?

.text is a property that joins all descendant text nodes with no separator. .get_text() is a method that supports a separator argument (defaults to ''), strip=True for whitespace cleaning, and types argument for element filtering. Use .get_text(separator=' ', strip=True) for clean, readable output.

How do I extract only paragraph text and skip navigation/footer?

Target specific selectors: soup.select('article p') for paragraphs inside an article element, or soup.select('main p') for content inside the main landmark. Avoid .get_text() on the whole soup — that includes nav, footer, scripts, and ads. Use selectors to scope to content.

Why is my scraped text full of weird whitespace?

HTML preserves whitespace in source but renders it normalized. BeautifulSoup gives you the source whitespace. Two fixes: (1) .get_text(separator=' ', strip=True) collapses to single spaces; (2) re.sub(r'\s+', ' ', text).strip() if you need more control.

Can I scrape text from any website?

Technically yes, legally it depends. Public-facing text (articles, product descriptions, public profiles) is generally scrapable under hiQ v. LinkedIn precedent. Behind-login content, copyrighted articles, and personal data trigger different rules — Terms of Service, copyright, GDPR. Read the target's ToS and don't redistribute scraped content as your own.

How fast can I scrape without getting blocked?

Single IP without proxies: ~1 request per second per domain, with random jitter. Faster triggers rate limits. Through a residential proxy pool with rotation: effectively unlimited per-domain throughput as long as you respect each target's robots.txt and return a realistic User-Agent. For sustained scraping, expect to add Retry-After honoring and exponential backoff.

Do I need proxies just to scrape text from one website?

For one-off scraping or small projects (under 1,000 pages), no — your home IP is fine. For repeated scraping of major retailers (Amazon, eBay), social platforms, or any site with anti-bot defenses, your IP gets rate-limited or blocked within minutes. Residential proxies bypass this. The threshold where you need proxies is when you start hitting 429 or 403 errors.

How to Scrape Text From Any Website (Python)

Alex R.

Mon May 04 2026

Quick verdict: Use requests + BeautifulSoup for static pages, Playwright for JavaScript-rendered ones. The text extraction method is .get_text(separator=' ', strip=True) on a properly-scoped element selector — never on the whole document. Add residential proxies when your IP starts getting rate-limited.

This guide covers the install, the static-page case (95% of real-world tutorials), the JavaScript case (the other 5% that trips up beginners), how to clean whitespace and skip navigation, and 6 working examples for common patterns.

Install

pip install requests beautifulsoup4 lxml

Static Pages — 5 Lines

import requests
from bs4 import BeautifulSoup

r = requests.get("https://en.wikipedia.org/wiki/Web_scraping")
soup = BeautifulSoup(r.text, "lxml")
print(soup.select_one("h1").get_text(strip=True))

That's the entire pipeline. Three more knobs you'll use:

get_text(separator=' ') — joins inline text with a separator instead of running together.
get_text(strip=True) — strips leading/trailing whitespace from each text node.
.select_one(css) vs .select(css) — first match vs all matches with CSS selectors.

JavaScript-Rendered Pages

If requests.get() returns HTML but you can't find the content you see in your browser, the page is JavaScript-rendered. requests doesn't run JS. Use Playwright:

pip install playwright
playwright install chromium

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/spa-app")
    page.wait_for_selector("article")  # wait for JS to render
    html = page.content()
    browser.close()

soup = BeautifulSoup(html, "lxml")
print(soup.select_one("article").get_text(separator=' ', strip=True))

Playwright is 10-100× slower than requests. Only use it when the page actually requires JS rendering.

6 Working Examples

1. All paragraph text from an article

paragraphs = [p.get_text(strip=True) for p in soup.select("article p")]
text = "

".join(paragraphs)

2. Article title + author + date

title = soup.select_one("h1").get_text(strip=True)
author = soup.select_one(".byline a").get_text(strip=True)
date = soup.select_one("time").get("datetime")  # ISO format from attr
print(f"{title} by {author} ({date})")

3. All visible text, excluding scripts/styles

for tag in soup(["script", "style", "nav", "footer"]):
    tag.decompose()  # remove from tree
visible_text = soup.get_text(separator=' ', strip=True)

4. Text from a specific section, skipping inline elements

main = soup.select_one("main")
# get_text on main keeps p, h1-6, but skips nav/footer outside main
content = main.get_text(separator='
', strip=True)

5. Pagination + collected text

import time
all_text = []
for page in range(1, 11):
    r = requests.get(f"https://example.com/articles?page={page}", timeout=20)
    soup = BeautifulSoup(r.text, "lxml")
    for art in soup.select("article"):
        all_text.append(art.get_text(separator=' ', strip=True))
    time.sleep(1)  # respectful rate

6. Through a residential proxy

proxies = {"https": "http://USER:[email protected]:8080"}
r = requests.get(url, proxies=proxies, headers={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/130.0.0.0"
}, timeout=20)
soup = BeautifulSoup(r.text, "lxml")
text = soup.select_one("article").get_text(separator=' ', strip=True)

Whitespace Cleanup

HTML often produces text like " Hello world ". Two clean-up patterns:

import re

# Method 1: BeautifulSoup's built-in
text = soup.get_text(separator=' ', strip=True)

# Method 2: regex normalize all whitespace runs to single space
text = re.sub(r's+', ' ', raw_text).strip()

# Method 3: keep paragraph breaks but normalize within
text = re.sub(r'[ 	]+', ' ', re.sub(r'
{3,}', '

', raw)).strip()

When You Need Proxies

For one-off scraping (under 1,000 pages), your home IP is fine. For repeated scraping you'll hit rate limits. The threshold:

Major retailers (Amazon, Walmart, eBay): proxies needed after ~50 requests
Social platforms (LinkedIn, Twitter, TikTok): proxies needed within minutes
News sites and blogs: usually fine without proxies for <1,000 requests
Cloudflare-protected sites: proxies + TLS fingerprint matching needed; see our Cloudflare bypass guide

For scaled scraping, use rotating residential proxies with Python requests — see our rotating proxies with Python requests guide for the implementation pattern.

Legal Notes

Public web text is generally scrapable for personal use under the hiQ v. LinkedIn precedent. Three lines that change the analysis:

Login walls — bypassing authentication is a CFAA violation regardless of public data behind it.
Copyright — scraping is fine; republishing the scraped text as your own may not be.
Personal data — under GDPR, processing personal data (names, emails) requires a lawful basis even if the source is public.