Python Web Scraping That Doesn’t Break: Requests + BeautifulSoup with Retries, Throttling, and Clean Data Output

Python Web Scraping That Doesn’t Break: Requests + BeautifulSoup with Retries, Throttling, and Clean Data Output

Web scraping is easy when you copy/paste a few lines from a tutorial—until the first time the site rate-limits you, changes a CSS class, or returns “Access Denied” intermittently. This hands-on guide shows a junior/mid-friendly pattern for scraping pages reliably using requests + BeautifulSoup, with practical upgrades like retries, timeouts, throttling, user-agent headers, and structured output to CSV/SQLite.

Assumptions: You’re scraping public pages you’re allowed to access. Always check a site’s Terms of Service and robots.txt, and be respectful with request rates.

Setup

Create a virtual environment and install dependencies:

python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate pip install requests beautifulsoup4 lxml

We’ll use:

  • requests for HTTP
  • BeautifulSoup for parsing HTML
  • lxml as a fast parser backend

A “production-ish” HTTP client: headers, timeouts, retries, backoff

Scrapers fail most often because of flaky networking or temporary blocks. Start by centralizing your HTTP logic:

import random import time from typing import Optional import requests from requests import Response from requests.adapters import HTTPAdapter from urllib3.util.retry import Retry def build_session() -> requests.Session: session = requests.Session() # A realistic User-Agent reduces “default bot” suspicion. session.headers.update({ "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 " "(KHTML, like Gecko) Chrome/122.0 Safari/537.36", "Accept-Language": "en-US,en;q=0.9", }) # Retry strategy: retries on common transient errors + rate-limits. retry = Retry( total=5, connect=5, read=5, status=5, backoff_factor=0.8, # exponential-ish backoff: 0.8, 1.6, 3.2... status_forcelist=[429, 500, 502, 503, 504], allowed_methods=["GET"], raise_on_status=False, ) adapter = HTTPAdapter(max_retries=retry, pool_connections=10, pool_maxsize=10) session.mount("http://", adapter) session.mount("https://", adapter) return session def polite_sleep(min_s: float = 0.6, max_s: float = 1.4) -> None: time.sleep(random.uniform(min_s, max_s)) def fetch(session: requests.Session, url: str, timeout: float = 15.0) -> Optional[Response]: polite_sleep() resp = session.get(url, timeout=timeout) # If the site returns a hard block page, treat it as failure. if resp.status_code >= 400: return None return resp

Why this matters:

  • timeout prevents hanging requests.
  • Retry handles temporary issues (including 429 rate limit).
  • polite_sleep() reduces load and bot detection risk.

Parsing HTML safely: don’t assume everything exists

HTML changes. Your code shouldn’t crash because a price tag is missing. Use helper functions to extract text with sensible fallbacks:

from bs4 import BeautifulSoup from bs4.element import Tag from typing import Optional def text_or_none(node: Optional[Tag]) -> Optional[str]: if not node: return None return node.get_text(strip=True) or None def attr_or_none(node: Optional[Tag], attr: str) -> Optional[str]: if not node: return None return node.get(attr) or None def parse_listing(html: str) -> list[dict]: soup = BeautifulSoup(html, "lxml") # Example: a page with multiple "cards" cards = soup.select(".card") # Adjust selector per site items: list[dict] = [] for card in cards: title = text_or_none(card.select_one(".card-title")) price = text_or_none(card.select_one(".price")) link = attr_or_none(card.select_one("a"), "href") # Normalize relative links if needed later items.append({ "title": title, "price": price, "link": link, }) return items

Tip: Prefer stable selectors. If a site has semantic attributes (like data-testid) or consistent structure, use those over “random” class names.

End-to-end example: scrape multiple pages + dedupe + export to CSV

Let’s build a small CLI-like script that:

  • Visits a list of paginated URLs
  • Parses items
  • De-duplicates by link/title
  • Exports to output.csv
import csv from urllib.parse import urljoin BASE_URL = "https://example.com" LISTING_URLS = [ "https://example.com/products?page=1", "https://example.com/products?page=2", "https://example.com/products?page=3", ] def normalize_item(item: dict) -> dict: # Convert relative link to absolute if item.get("link"): item["link"] = urljoin(BASE_URL, item["link"]) return item def export_csv(rows: list[dict], path: str) -> None: fieldnames = ["title", "price", "link"] with open(path, "w", newline="", encoding="utf-8") as f: writer = csv.DictWriter(f, fieldnames=fieldnames) writer.writeheader() writer.writerows(rows) def main() -> None: session = build_session() all_items: list[dict] = [] for url in LISTING_URLS: resp = fetch(session, url) if not resp: print(f"Failed to fetch: {url}") continue items = parse_listing(resp.text) all_items.extend(normalize_item(x) for x in items) print(f"{url}: +{len(items)} items") # Dedupe by link (or title if link missing) seen = set() deduped: list[dict] = [] for item in all_items: key = item.get("link") or item.get("title") if not key or key in seen: continue seen.add(key) deduped.append(item) export_csv(deduped, "output.csv") print(f"Exported {len(deduped)} unique items to output.csv") if __name__ == "__main__": main()

Swap BASE_URL, LISTING_URLS, and CSS selectors in parse_listing() to match your target site.

Pagination the robust way: discover “next” links instead of guessing

Hardcoding page numbers works until the site changes. A better approach is “follow the next link until it disappears”:

from urllib.parse import urljoin def find_next_page(html: str, current_url: str) -> str | None: soup = BeautifulSoup(html, "lxml") next_a = soup.select_one("a[rel='next']") or soup.select_one("a.next") href = attr_or_none(next_a, "href") return urljoin(current_url, href) if href else None def scrape_all_pages(start_url: str, max_pages: int = 50) -> list[dict]: session = build_session() url = start_url page = 0 items: list[dict] = [] while url and page < max_pages: page += 1 resp = fetch(session, url) if not resp: print(f"Failed page {page}: {url}") break items.extend(parse_listing(resp.text)) url = find_next_page(resp.text, url) print(f"Scraped page {page}, next={url}") return items

This pattern is more resilient because it follows the site’s own navigation.

Persisting data to SQLite for free querying

CSV is nice, but SQLite makes it easy to query results later (and avoids duplicates using a unique constraint):

import sqlite3 CREATE_SQL = """ CREATE TABLE IF NOT EXISTS items ( id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT, price TEXT, link TEXT UNIQUE ); """ INSERT_SQL = """ INSERT OR IGNORE INTO items (title, price, link) VALUES (?, ?, ?); """ def save_to_sqlite(db_path: str, rows: list[dict]) -> None: conn = sqlite3.connect(db_path) try: conn.execute(CREATE_SQL) for r in rows: conn.execute(INSERT_SQL, (r.get("title"), r.get("price"), r.get("link"))) conn.commit() finally: conn.close()

Now you can run:

sqlite3 scraped.db "SELECT COUNT(*) FROM items;" sqlite3 scraped.db "SELECT title, link FROM items LIMIT 10;"

Common failure modes (and fixes)

  • You get empty results, but the page loads in your browser.
    The content might be rendered by JavaScript. requests downloads the initial HTML, not the post-JS DOM. Options: find an underlying JSON API the page calls, or use a browser automation tool when necessary.

  • Random 403/429 errors.
    Slow down (increase polite_sleep), add jitter, rotate through a small set of user agents, and avoid scraping too fast. Retries help, but the real fix is respecting rate limits.

  • Selectors break after a redesign.
    Add defensive parsing, log samples of HTML when parsing fails, and prefer stable selectors (semantic attributes, consistent headings, URL patterns).

  • Data is messy (prices, whitespace, inconsistent formats).
    Normalize right after parsing: strip currency symbols, convert to decimals, standardize dates, and store both raw and cleaned fields if needed.

Checklist you can reuse on your next scraper

  • Session with headers + connection pooling
  • Timeouts on every request
  • Retries with backoff for 429 and 5xx
  • Throttling with jitter (sleep randomization)
  • HTML parsing with fallbacks (never assume elements exist)
  • Dedupe strategy (unique key like link)
  • Export to CSV and/or persist to SQLite
  • Logs for failures and samples for debugging

If you want, tell me what kind of site you’re scraping (e.g., product listings, job boards, docs pages) and I’ll tailor the CSS selectors, pagination strategy, and data cleaning to that structure.


Leave a Reply

Your email address will not be published. Required fields are marked *