Python Web Scraping That Doesn’t Break Immediately: Retries, Throttling, Parsing, and Storing Results
“Web scraping” often starts as requests.get() + copy/paste a selector… and then falls apart the first time a server rate-limits you or the HTML changes slightly. This hands-on guide shows a practical pattern you can reuse: resilient HTTP fetching (retries + backoff), polite throttling, structured parsing with BeautifulSoup, pagination, and saving results to SQLite so your scraper is restartable.
Important: Always review a site’s Terms of Service and robots rules, and avoid scraping personal data. Use conservative request rates and identify your client with a user-agent.
What we’ll build
A small scraper that:
- Fetches pages with retries, exponential backoff, and timeouts
- Throttles requests to avoid hammering the server
- Parses items from HTML into structured records
- Follows pagination until there are no more pages
- Stores results in
SQLitewith upserts (safe re-runs)
Install dependencies
You only need two libraries:
pip install requests beautifulsoup4
1) A resilient HTTP client (retries + backoff + polite headers)
Don’t treat HTTP as “always works.” You need timeouts, a retry strategy for transient failures, and a reasonable User-Agent. Here’s a drop-in helper:
import random import time from typing import Optional import requests from requests import Response DEFAULT_HEADERS = { "User-Agent": "MyScraperBot/1.0 (+contact: [email protected])", "Accept": "text/html,application/xhtml+xml", } def fetch_html( url: str, session: requests.Session, *, timeout: float = 15.0, max_retries: int = 5, base_backoff: float = 0.8, max_backoff: float = 10.0, ) -> str: """ Fetch HTML with retries and exponential backoff + jitter. Retries on 429 and 5xx, plus network errors. """ for attempt in range(1, max_retries + 1): try: resp: Response = session.get(url, timeout=timeout) # Handle rate limiting / transient server errors if resp.status_code in (429, 500, 502, 503, 504): raise requests.HTTPError(f"Transient HTTP {resp.status_code}", response=resp) resp.raise_for_status() return resp.text except (requests.Timeout, requests.ConnectionError, requests.HTTPError) as e: if attempt == max_retries: raise # Exponential backoff with jitter backoff = min(max_backoff, base_backoff * (2 ** (attempt - 1))) jitter = random.uniform(0, 0.3 * backoff) sleep_for = backoff + jitter # If server provides Retry-After, respect it retry_after: Optional[str] = None if isinstance(e, requests.HTTPError) and e.response is not None: retry_after = e.response.headers.get("Retry-After") if retry_after and retry_after.isdigit(): sleep_for = max(sleep_for, float(retry_after)) time.sleep(sleep_for) # Unreachable raise RuntimeError("fetch_html failed unexpectedly")
Why this matters: retries + backoff smooth out random network glitches and temporary server throttling. Without this, your scraper will fail sporadically and be painful to debug.
2) Throttle requests (don’t get blocked)
Even if the site doesn’t block you, you can still overload it. Add a simple “wait between requests” policy:
def throttle(min_delay: float = 1.0, max_delay: float = 2.0) -> None: """Sleep a random amount to reduce request bursts.""" time.sleep(random.uniform(min_delay, max_delay))
Use this between page fetches. Randomized delays are harder to fingerprint than perfectly periodic traffic.
3) Parse HTML into structured data
BeautifulSoup works best when you extract fields defensively. Avoid selectors that depend on brittle nesting. Prefer stable attributes (like data-*) if available. Here’s a generic example that parses “cards”:
from bs4 import BeautifulSoup from dataclasses import dataclass from typing import List, Optional from urllib.parse import urljoin @dataclass class Item: item_id: str title: str price: Optional[float] url: str def parse_items(html: str, base_url: str) -> List[Item]: soup = BeautifulSoup(html, "html.parser") items: List[Item] = [] # Example: each listing card is a <div class="card"> with a link and optional price for card in soup.select("div.card"): a = card.select_one("a.card-link") if not a or not a.get("href"): continue title = (a.get_text(strip=True) or "").strip() full_url = urljoin(base_url, a["href"]) # Create a stable id from URL (or from a data attribute if present) item_id = card.get("data-id") or full_url price_el = card.select_one(".price") price = None if price_el: raw = price_el.get_text(strip=True) # Example cleanup: "$12.34" -> 12.34 raw = raw.replace("$", "").replace(",", "") try: price = float(raw) except ValueError: price = None if title: items.append(Item(item_id=item_id, title=title, price=price, url=full_url)) return items
Defensive parsing tips:
- Check for missing elements (
select_onecan returnNone) - Normalize text with
get_text(strip=True) - Make IDs stable so you can re-run the scraper safely
4) Find and follow pagination
Most list pages have a “next” link. Keep it simple: if there’s no next link, stop.
from typing import Optional def find_next_page(html: str, base_url: str) -> Optional[str]: soup = BeautifulSoup(html, "html.parser") # Example: <a rel="next" href="...">Next</a> next_a = soup.select_one('a[rel="next"]') if next_a and next_a.get("href"): return urljoin(base_url, next_a["href"]) # Fallback: common "Next" button class next_a = soup.select_one("a.next, a.pagination-next") if next_a and next_a.get("href"): return urljoin(base_url, next_a["href"]) return None
5) Store results in SQLite (restartable scrapes)
Saving to a database means you can resume after failure and avoid duplicating work. SQLite is perfect for small/medium runs.
import sqlite3 from typing import Iterable def init_db(path: str = "items.db") -> sqlite3.Connection: conn = sqlite3.connect(path) conn.execute(""" CREATE TABLE IF NOT EXISTS items ( item_id TEXT PRIMARY KEY, title TEXT NOT NULL, price REAL, url TEXT NOT NULL, scraped_at TEXT NOT NULL DEFAULT (datetime('now')) ) """) conn.commit() return conn def upsert_items(conn: sqlite3.Connection, items: Iterable[Item]) -> None: conn.executemany(""" INSERT INTO items (item_id, title, price, url) VALUES (?, ?, ?, ?) ON CONFLICT(item_id) DO UPDATE SET title = excluded.title, price = excluded.price, url = excluded.url, scraped_at = datetime('now') """, [(i.item_id, i.title, i.price, i.url) for i in items]) conn.commit()
Why upsert? If the scraper is re-run, records update instead of duplicating. If a price changes, you capture the latest value.
6) Put it all together: a complete scraper
This script scrapes from a starting URL, follows pagination, and writes items to SQLite. Replace START_URL and the CSS selectors in parse_items to match your target site.
import requests START_URL = "https://example.com/listings" # change me def scrape(start_url: str) -> None: conn = init_db("items.db") with requests.Session() as session: session.headers.update(DEFAULT_HEADERS) url = start_url page = 1 while url: print(f"[page {page}] GET {url}") html = fetch_html(url, session) items = parse_items(html, base_url=url) if items: upsert_items(conn, items) print(f" saved {len(items)} items") else: print(" no items found (check selectors?)") next_url = find_next_page(html, base_url=url) url = next_url page += 1 if url: throttle(1.2, 2.4) conn.close() print("done") if __name__ == "__main__": scrape(START_URL)
Common failure modes (and how to handle them)
- You get blocked (403/429): slow down, add backoff (already included), and ensure your
User-Agentis reasonable. Don’t try to “bypass” protections—pick another data source or get permission. - HTML changes: treat selectors as configuration. Keep parsing logic small and easy to update.
- Data is loaded via JavaScript: check the Network tab. Often there’s a JSON endpoint you can call directly (preferred over rendering pages).
- Duplicate items across pages: solve with stable IDs + upserts (already included).
- Partial runs: persist progress (e.g., last page URL) or deduplicate by checking
item_idbefore inserting. SQLite upsert covers most needs.
Next steps to level up
Once the basics work, consider:
- Structured logging (timestamps, status codes, retry counts)
- Command-line options (start URL, max pages, output path)
- Incremental scraping (skip items already seen, scrape only new pages)
- Export to CSV/JSON for downstream processing
With this pattern—resilient fetch, polite throttling, defensive parsing, and durable storage—you can build scrapers that behave well in production-like conditions and don’t implode the first time the network sneezes.
Leave a Reply