What's the easiest way to scrape images from a website in Python?

Use requests + BeautifulSoup. Fetch the page, parse tags, then GET each image URL. For five lines: r = requests.get(page); soup = BeautifulSoup(r.text, 'lxml'); for img in soup.find_all('img'): open(name, 'wb').write(requests.get(urljoin(page, img['src'])).content). This works for any static HTML page. Lazy-loaded galleries need Playwright instead.

How do I scrape lazy-loaded images?

The HTML you fetch with requests only contains placeholder URLs (often a 1x1 gif or blurred LQIP). The real image URL appears after the browser fires IntersectionObserver. Use Playwright to scroll the page in steps, wait between scrolls, then read img.currentSrc for each . Check data-src, data-original, data-lazy attributes too — many sites store the real URL there before JS swaps it.

How do I get the highest-resolution image from srcset?

Parse the srcset attribute, which lists multiple URLs with width descriptors like '500w, 1024w, 2048w'. Pick the URL with the largest width value. img.currentSrc in a real browser already resolves to the best candidate for the viewport; if you scrape with Playwright and read currentSrc, you get the resolution the browser chose.

Will image hosts block my scraper?

Yes, if you hit them from a single IP at volume. Image hosts (Imgur, Cloudinary, S3 with hot-link protection) rate-limit by IP because each image transfer costs them bandwidth. Use a rotating residential proxy pool, pass a real Referer header matching the page URL, and limit concurrency to ~10-20 requests/second per source.

How do I deduplicate scraped images?

SHA-256 of the bytes catches exact duplicates — keep a set of seen hashes and skip matches. For near-duplicates (resized, recompressed, watermarked), use perceptual hashing with imagehash.phash(); near-duplicates share a Hamming distance under ~5. Run dedup on the fly, not after saving, or you'll burn disk.

Is it legal to scrape images from a website?

It depends on the site, the images' copyright status, and your use. Public visibility does not grant a license. Always check the site's ToS, robots.txt, and any noimageai or noai meta tags. For commercial use or AI training, prefer Creative Commons sources (Flickr CC, Unsplash, Wikimedia Commons) or licensed APIs. Don't hot-link in production; always store your own copy.

Can I scrape images behind a login?

Yes, with a logged-in session. Use requests.Session() to persist cookies after login, or pass an auth cookie directly via headers={'Cookie': '...'}. For SPA dashboards (Instagram, Facebook) use Playwright with storage_state to reuse a logged-in browser state. Note that ToS often prohibits scraping logged-in content, even your own.

How do I extract alt text and captions?

img.get('alt') gives the alt attribute. For captions, walk the parent: img.find_parent('figure').find('figcaption').get_text(). Many sites also store descriptive text in data-caption or aria-label attributes. For training image-language models, capture alt + surrounding heading + caption together — they're often complementary.

Scrape Images From Websites: Python Guide (2026)

Alex R.

Sat May 16 2026

Quick answer: For most static sites — requests + BeautifulSoup to grab the URLs from <img> tags, then a worker pool to download. For lazy-loaded galleries (Instagram-style infinite scroll, React/Vue image grids), use Playwright with explicit scroll-and-wait. For volume (100k+ images), switch to async httpx with a rotating proxy pool. Always prefer the srcset highest-resolution candidate over src, dedup by SHA-256 of bytes, and respect robots.txt + copyright.

Setup

pip install requests beautifulsoup4 lxml httpx playwright pillow
playwright install chromium

Method 1: Static Sites (requests + BeautifulSoup)

The simplest case — image URLs are in the HTML as <img src="...">.

import os
import hashlib
from urllib.parse import urljoin, urlparse
import requests
from bs4 import BeautifulSoup

HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; ImageBot/1.0; +https://example.com/bot)"
}

def best_url(img, page_url):
    '''Pick highest-res candidate from srcset, fall back to src.'''
    srcset = img.get("srcset") or img.get("data-srcset")
    if srcset:
        candidates = []
        for part in srcset.split(","):
            tokens = part.strip().split()
            if len(tokens) >= 2 and tokens[-1].endswith("w"):
                candidates.append((int(tokens[-1][:-1]), tokens[0]))
        if candidates:
            return urljoin(page_url, max(candidates)[1])
    for attr in ("src", "data-src", "data-original", "data-lazy"):
        if img.get(attr):
            return urljoin(page_url, img[attr])
    return None

def scrape_images(page_url, out_dir="images"):
    os.makedirs(out_dir, exist_ok=True)
    r = requests.get(page_url, headers=HEADERS, timeout=20)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "lxml")

    seen = set()
    for img in soup.find_all("img"):
        url = best_url(img, page_url)
        if not url or url.startswith("data:"):
            continue
        alt = img.get("alt", "")
        try:
            ir = requests.get(url, headers=HEADERS, timeout=20)
            ir.raise_for_status()
        except Exception as e:
            print(f"  skip {url}: {e}")
            continue
        digest = hashlib.sha256(ir.content).hexdigest()[:16]
        if digest in seen:
            continue
        seen.add(digest)
        ext = os.path.splitext(urlparse(url).path)[1] or ".jpg"
        fname = f"{digest}{ext}"
        with open(os.path.join(out_dir, fname), "wb") as f:
            f.write(ir.content)
        print(f"  saved {fname}  alt={alt!r}")

scrape_images("https://example.com/gallery")

Method 2: Volume Scraping (async httpx)

For 1,000+ images you want concurrent downloads — a sync loop wastes 95% of the time waiting on I/O.

import asyncio
import hashlib
import os
import httpx
from bs4 import BeautifulSoup

PROXY = "http://USER:[email protected]:8000"
CONCURRENCY = 20

async def fetch_page(client, url):
    r = await client.get(url, timeout=20)
    r.raise_for_status()
    return r.text

async def download(client, url, out_dir, sem, seen):
    async with sem:
        try:
            r = await client.get(url, timeout=20)
            r.raise_for_status()
        except Exception as e:
            return None
        digest = hashlib.sha256(r.content).hexdigest()[:16]
        if digest in seen:
            return None
        seen.add(digest)
        ext = os.path.splitext(url.split("?")[0])[1] or ".jpg"
        path = os.path.join(out_dir, f"{digest}{ext}")
        with open(path, "wb") as f:
            f.write(r.content)
        return path

async def main(page_url, out_dir="images"):
    os.makedirs(out_dir, exist_ok=True)
    async with httpx.AsyncClient(proxy=PROXY, http2=True,
                                  headers={"User-Agent": "ImageBot/1.0"}) as client:
        html = await fetch_page(client, page_url)
        soup = BeautifulSoup(html, "lxml")
        urls = [img.get("src") for img in soup.find_all("img") if img.get("src")]
        sem = asyncio.Semaphore(CONCURRENCY)
        seen = set()
        tasks = [download(client, u, out_dir, sem, seen) for u in urls]
        results = await asyncio.gather(*tasks)
        print(f"saved {sum(1 for r in results if r)} / {len(urls)} images")

asyncio.run(main("https://example.com/gallery"))

On a Premium Residential pool, 20 concurrent workers comfortably hit 50–100 images/sec without tripping rate limits. Push higher (50–100 concurrent) only on residential, never on a single datacenter IP.

Method 3: Lazy-Loaded Galleries (Playwright)

Modern sites defer image loading via IntersectionObserver or React virtualized lists. The HTML you fetch with requests has placeholders; the real src only appears after scroll. Use Playwright to drive a real browser.

import asyncio
from playwright.async_api import async_playwright

async def scrape_lazy(url):
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy={"server": "http://gw.spyderproxy.com:8000",
                   "username": "USER", "password": "PASS"},
        )
        page = await browser.new_page()
        await page.goto(url, wait_until="domcontentloaded")

        # Scroll to bottom in steps so IntersectionObserver fires for each row
        prev_height = 0
        for _ in range(30):
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            await page.wait_for_timeout(800)
            h = await page.evaluate("document.body.scrollHeight")
            if h == prev_height:
                break
            prev_height = h

        urls = await page.evaluate(() => Array.from(document.querySelectorAll('img'))
            .map(i => i.currentSrc || i.src)
            .filter(s => s && !s.startsWith('data:')))
        await browser.close()
        return urls

urls = asyncio.run(scrape_lazy("https://example.com/feed"))
print(f"found {len(urls)} images")

Key Playwright tips: use img.currentSrc (not img.src) to get the resolution the browser actually picked from srcset, and scroll in steps with a wait between — one big scroll skips intermediate observer callbacks.

Handling srcset Properly

The srcset attribute lets a site offer multiple resolutions. Cheap scrapers grab the small src placeholder and end up with thumbnails. Parse srcset and pick the highest-width candidate.

def parse_srcset(srcset):
    '''Return list of (url, descriptor_value, descriptor_type).'''
    out = []
    for part in srcset.split(","):
        tokens = part.strip().split()
        if not tokens:
            continue
        url = tokens[0]
        if len(tokens) == 1:
            out.append((url, 1.0, "x"))
        else:
            d = tokens[-1]
            if d.endswith("w"):
                out.append((url, float(d[:-1]), "w"))
            elif d.endswith("x"):
                out.append((url, float(d[:-1]), "x"))
    return out

Extracting Alt Text + Metadata

For training image-language models you want the alt attribute and surrounding caption. Caption text is often in a sibling <figcaption> or a parent <figure>:

def get_caption(img):
    fig = img.find_parent("figure")
    if fig:
        cap = fig.find("figcaption")
        if cap:
            return cap.get_text(strip=True)
    return img.get("alt", "")

Deduplication at Scale

SHA-256 of the bytes catches exact duplicates. For near-duplicates (resized, recompressed) use perceptual hashing:

from PIL import Image
import imagehash

def phash(path):
    return str(imagehash.phash(Image.open(path)))

# Group by phash; near-duplicates share a prefix.

Why You Need Rotating Proxies

Image hosts (Imgur, Cloudinary, Akamai CDNs) rate-limit aggressively by IP because their bandwidth bill scales with hot-linkers. Pull 5,000 images from one IP in 10 minutes and you're looking at HTTP 429 or 403. Rotating residential proxies solve this by spreading requests across thousands of consumer IPs.

Budget Residential — $1.75/GB, 10M+ IPs. Standard pick for image volume.
Premium Residential — $2.75/GB, sticky sessions if you need to keep cookies for paginated galleries.
Static Datacenter — $1.50/proxy/month. Works only on hosts that don't fingerprint by ASN.

Legal & Ethical Notes

Copyright applies to images. Public visibility ≠ license. For training datasets, check the site's license, ToS, and any explicit noimageai meta tag.
Respect robots.txt — it's machine-readable consent. Most image-heavy sites (Pinterest, Getty) block bots explicitly.
Don't hot-link. Always download to your own storage; never embed remote URLs in production.
EXIF strip personal data. Phone photos often carry GPS coordinates. Strip with piexif if redistributing.

Common Errors

403 on the image URL but 200 on the page — the host checks Referer. Pass it: headers={"Referer": page_url}.
All saved files are 1KB blank gifs — you scraped the lazy-load placeholder, not the real image. Switch to Playwright or read data-src.
Random 200s with HTML instead of bytes — the CDN served a captcha page. Inspect Content-Type; reject anything that isn't image/*.
Massive disk usage — dedup early. Hash on the fly, not after saving.