Python Web Scraping That Doesn’t Break: A Practical Pattern for Junior/Mid Devs
Web scraping is easy to start and surprisingly hard to keep working. The “hello world” scraper (requests.get() + regex) usually dies the moment a site adds rate limits, changes HTML, or returns slightly different markup. In this hands-on guide, you’ll build a small but resilient scraping workflow using httpx (HTTP client), BeautifulSoup (HTML parsing), and a few production-grade habits: timeouts, retries, polite throttling, pagination, and saving results to SQLite.
Goal: scrape a paginated listing page, extract structured data, and persist it safely—without hammering the site.
Before You Scrape: The 3 Rules
- Check permissions. Look at a site’s Terms of Service and
/robots.txt. Some sites explicitly disallow scraping. - Be polite. Add delays, identify your scraper with a user agent, and keep concurrency low.
- Expect change. Write parsers that are tolerant (optional fields, fallback selectors), and log failures.
Setup
Install dependencies:
python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate pip install httpx beautifulsoup4 lxml
lxml makes HTML parsing faster and more forgiving.
A Reusable Scraper Skeleton
Start with a clean structure: configuration, a fetch layer, a parse layer, and a persistence layer. Here’s a single-file example you can copy and run. You’ll need to change BASE_URL and CSS selectors to match your target site.
import time import random import sqlite3 from dataclasses import dataclass from typing import Iterable, Optional from urllib.parse import urljoin import httpx from bs4 import BeautifulSoup BASE_URL = "https://example.com" # change me LISTING_PATH = "/products?page={page}" # change me @dataclass class Item: title: str price: Optional[str] url: str def make_client() -> httpx.Client: headers = { # Identify yourself. Consider adding contact info for ethical scraping. "User-Agent": "Mozilla/5.0 (compatible; MyScraper/1.0; +https://your-site.example)", "Accept": "text/html,application/xhtml+xml", "Accept-Language": "en-US,en;q=0.9", } return httpx.Client( headers=headers, timeout=httpx.Timeout(10.0, connect=5.0), follow_redirects=True, ) def fetch_html(client: httpx.Client, url: str, retries: int = 3) -> str: """ Fetch HTML with basic retry/backoff for transient failures. """ backoff = 1.0 for attempt in range(1, retries + 1): try: resp = client.get(url) # Handle common rate-limit signals if resp.status_code in (429, 503): raise httpx.HTTPStatusError("Rate limited / unavailable", request=resp.request, response=resp) resp.raise_for_status() return resp.text except (httpx.TimeoutException, httpx.TransportError, httpx.HTTPStatusError) as e: if attempt == retries: raise sleep_for = backoff + random.uniform(0, 0.5) print(f"[warn] fetch failed ({attempt}/{retries}) {url} -> {type(e).__name__}, sleeping {sleep_for:.2f}s") time.sleep(sleep_for) backoff *= 2 raise RuntimeError("Unreachable")
Parsing HTML Reliably (Without Regex)
Now define how to extract items from the listing page. You’ll need to inspect the HTML of your target page and update the selectors.
def parse_listing(html: str) -> list[Item]: soup = BeautifulSoup(html, "lxml") items: list[Item] = [] # Example selectors (change to match your site): # Suppose each product is in <article class="product">... for card in soup.select("article.product"): title_el = card.select_one(".product-title") link_el = card.select_one("a.product-link") if not title_el or not link_el or not link_el.get("href"): # Skip cards that don't match expected shape continue title = title_el.get_text(strip=True) url = urljoin(BASE_URL, link_el["href"]) price_el = card.select_one(".product-price") price = price_el.get_text(strip=True) if price_el else None items.append(Item(title=title, price=price, url=url)) return items def has_next_page(html: str) -> bool: soup = BeautifulSoup(html, "lxml") # Example: a "Next" button with rel="next" return soup.select_one('a[rel="next"]') is not None
Why this approach holds up:
- You treat missing fields as normal (
pricecan beNone). - You skip malformed cards instead of crashing the whole run.
- You keep parsing logic separate from network logic.
Saving Results to SQLite (Idempotently)
Dumping to JSON is fine for quick scripts, but SQLite is great when you want “run it again tomorrow” workflows. Use a unique constraint so you don’t insert duplicates.
def init_db(conn: sqlite3.Connection) -> None: conn.execute( """ CREATE TABLE IF NOT EXISTS items ( id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT NOT NULL, price TEXT, url TEXT NOT NULL UNIQUE, scraped_at TEXT NOT NULL DEFAULT (datetime('now')) ); """ ) conn.commit() def save_items(conn: sqlite3.Connection, items: Iterable[Item]) -> int: inserted = 0 for it in items: try: conn.execute( "INSERT INTO items (title, price, url) VALUES (?, ?, ?)", (it.title, it.price, it.url), ) inserted += 1 except sqlite3.IntegrityError: # URL already exists - ignore pass conn.commit() return inserted
Putting It Together: Pagination + Polite Throttling
This main loop scrapes pages until there’s no next page (or until a max page limit). It also sleeps a bit between requests to reduce load.
def scrape(max_pages: int = 20) -> None: conn = sqlite3.connect("scrape.db") init_db(conn) with make_client() as client: page = 1 total_seen = 0 total_inserted = 0 while page <= max_pages: url = urljoin(BASE_URL, LISTING_PATH.format(page=page)) print(f"[info] fetching page {page}: {url}") html = fetch_html(client, url) items = parse_listing(html) total_seen += len(items) inserted = save_items(conn, items) total_inserted += inserted print(f"[info] page {page}: found={len(items)} inserted={inserted} total_inserted={total_inserted}") # Stop if no "next" page or if page looks empty (site may have changed) if not items or not has_next_page(html): break # Polite delay (tweak as needed) time.sleep(1.0 + random.uniform(0, 0.5)) page += 1 conn.close() print(f"[done] total_seen={total_seen}, total_inserted={total_inserted}") if __name__ == "__main__": scrape(max_pages=50)
Debugging Tips When Selectors Don’t Match
- Print the first card’s HTML. After
soup.select(...), inspectstr(card)[:500]to verify you’re targeting the right elements. - Log “why skipped.” Add counters for missing
title/linkso you can see what’s failing. - Save raw HTML snapshots. When parsing fails, write
htmlto a file likedebug_page_3.htmland inspect it locally.
Common Real-World Problems (And What To Do)
- Dynamic content (JS-rendered): If the HTML response doesn’t contain the data (only script tags), you’ll need a browser-based approach (e.g., Playwright) or use the site’s underlying API calls (often visible in DevTools Network tab).
- Bot protection / CAPTCHAs: Don’t play whack-a-mole. Consider an official API, partnership, or alternative dataset. If you’re authorized, browser automation may help—but stay within legal/ethical boundaries.
- Rate limits (429): Slow down, add exponential backoff, and keep concurrency low. Respect
Retry-Afterheaders when present. - HTML changes: Write parsers with fallback selectors and treat missing fields as normal. Add monitoring (e.g., alert when “found items” suddenly drops to zero).
Small Upgrades That Make a Big Difference
- Use structured logging. Even simple
print()lines with tags ([info],[warn]) help when debugging. - Store a “run id.” Add a
run_idcolumn so you can compare results across runs. - Normalize data. Parse prices into numeric values (strip currency symbols, handle commas) so you can query them later.
- Make it configurable. Put
BASE_URL, selectors, and delays into a config file or CLI args.
Wrap-Up
You now have a scraper pattern you can reuse: fetch with retries, parse with tolerant selectors, paginate safely, and persist idempotently. For many real websites, this approach is enough—and when it isn’t (JS rendering), you’ll know exactly why, and what tool class you need next.
Leave a Reply