Python Web Scraping with Requests + BeautifulSoup: Build a Resilient Scraper with Retries, Throttling, and Incremental CSV Output

Python Web Scraping with Requests + BeautifulSoup: Build a Resilient Scraper with Retries, Throttling, and Incremental CSV Output

Web scraping is easy to start and easy to break. A junior-friendly scraper often works once, then fails on the first network hiccup, HTML change, or rate limit. In this hands-on guide, you’ll build a small but solid Python scraper using requests + BeautifulSoup with:

  • Reusable HTTP session + safe headers
  • Retries with exponential backoff
  • Polite throttling (and random jitter)
  • Robust parsing (avoid brittle selectors)
  • Incremental writes to CSV (resume-friendly)

Use case: scrape a list page of “products/articles” and then visit each detail page to extract fields like title, price, rating, and url. The exact website doesn’t matter—this pattern applies broadly.

1) Project Setup

Install dependencies:

python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate pip install requests beautifulsoup4 lxml

Why lxml? It’s faster and typically more tolerant of messy HTML than Python’s default parser.

2) Start with a “Good Citizen” HTTP Client

Many sites block requests that look like scripts. The goal is not to bypass security, but to behave like a normal browser and reduce accidental blocking.

import random import time from dataclasses import dataclass from typing import Optional import requests from requests.adapters import HTTPAdapter from urllib3.util.retry import Retry @dataclass class ScrapeConfig: base_url: str timeout_s: int = 15 min_delay_s: float = 0.6 max_delay_s: float = 1.4 user_agent: str = ( "Mozilla/5.0 (X11; Linux x86_64) " "AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/122.0 Safari/537.36" ) def build_session(cfg: ScrapeConfig) -> requests.Session: session = requests.Session() session.headers.update({ "User-Agent": cfg.user_agent, "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.8", "Connection": "keep-alive", }) retry = Retry( total=5, connect=5, read=5, backoff_factor=0.7, # exponential backoff status_forcelist=[429, 500, 502, 503, 504], allowed_methods=["GET"], raise_on_status=False, ) adapter = HTTPAdapter(max_retries=retry, pool_connections=20, pool_maxsize=20) session.mount("http://", adapter) session.mount("https://", adapter) return session def polite_sleep(cfg: ScrapeConfig) -> None: time.sleep(random.uniform(cfg.min_delay_s, cfg.max_delay_s)) def fetch_html(session: requests.Session, url: str, cfg: ScrapeConfig) -> str: resp = session.get(url, timeout=cfg.timeout_s) # If the server sends a non-2xx status, raise for most errors if resp.status_code >= 400 and resp.status_code not in (404,): resp.raise_for_status() return resp.text

What you get: automatic retries for transient issues and rate limits (429) plus a simple sleep function to reduce hammering.

3) Parse HTML Without Overfitting to the Page

A common mistake is using hyper-specific CSS selectors that break when the site’s markup shifts. Prefer stable attributes (like data-*) or broader patterns.

Here’s a parsing approach that works with many “card list” pages:

from bs4 import BeautifulSoup from urllib.parse import urljoin def parse_list_page(html: str, base_url: str) -> list[str]: soup = BeautifulSoup(html, "lxml") # Example strategy: # 1) Try to find links that look like item detail pages. # 2) Avoid nav/footer links by scoping to a main container if possible. main = soup.select_one("main") or soup # fallback if there's no <main> links = [] for a in main.select("a[href]"): href = a.get("href", "").strip() if not href: continue # Heuristic: keep only links that look like detail pages # e.g., "/item/123", "/products/widget", "/articles/some-slug" if any(seg in href for seg in ("/item/", "/product", "/products/", "/article", "/articles/")): links.append(urljoin(base_url, href)) # De-duplicate while keeping order seen = set() out = [] for u in links: if u not in seen: seen.add(u) out.append(u) return out

This uses a heuristic, which you can tighten once you know the target site’s URL structure. The goal is to avoid selecting “every link on the page” and scraping junk.

4) Parse Detail Pages into a Structured Record

Detail pages usually have better structure: a product title, maybe a price element, and a rating. Your job is to extract fields in a way that tolerates missing data.

import re from typing import TypedDict class Item(TypedDict): url: str title: str price: Optional[str] rating: Optional[float] def text_or_none(el) -> Optional[str]: if not el: return None txt = el.get_text(strip=True) return txt or None def parse_rating(text: Optional[str]) -> Optional[float]: if not text: return None # Extract first float-like number: "4.6 out of 5" -> 4.6 m = re.search(r"(\d+(?:\.\d+)?)", text) return float(m.group(1)) if m else None def parse_detail_page(html: str, url: str) -> Item: soup = BeautifulSoup(html, "lxml") # Title: try common patterns title = ( text_or_none(soup.select_one("h1")) or text_or_none(soup.select_one("[data-testid='title']")) or text_or_none(soup.select_one(".product-title")) or "UNKNOWN" ) # Price: try common patterns price = ( text_or_none(soup.select_one("[data-testid='price']")) or text_or_none(soup.select_one(".price")) or text_or_none(soup.find(string=re.compile(r"\$\s?\d"))) ) if isinstance(price, str) and len(price) > 60: # If we accidentally captured a long text node, discard it price = None # Rating: try common patterns rating_text = ( text_or_none(soup.select_one("[data-testid='rating']")) or text_or_none(soup.select_one(".rating")) or text_or_none(soup.find(string=re.compile(r"out of 5|stars", re.I))) ) rating = parse_rating(rating_text) return {"url": url, "title": title, "price": price, "rating": rating}

Key idea: you’re not writing “the one correct selector.” You’re writing a small set of fallbacks that cover typical variations and fail gracefully.

5) Incremental CSV Output (So You Can Resume)

Scrapers that hold everything in memory tend to lose work if they crash halfway through. Instead, write each record as you go. Also keep a “seen URLs” set so you don’t duplicate entries if you rerun.

import csv import os from typing import Iterable CSV_FIELDS = ["url", "title", "price", "rating"] def load_seen_urls(csv_path: str) -> set[str]: if not os.path.exists(csv_path): return set() seen = set() with open(csv_path, newline="", encoding="utf-8") as f: reader = csv.DictReader(f) for row in reader: if row.get("url"): seen.add(row["url"]) return seen def append_rows(csv_path: str, rows: Iterable[dict]) -> None: file_exists = os.path.exists(csv_path) with open(csv_path, "a", newline="", encoding="utf-8") as f: writer = csv.DictWriter(f, fieldnames=CSV_FIELDS) if not file_exists: writer.writeheader() for r in rows: writer.writerow(r)

6) Put It Together: Crawl List Pages → Scrape Details

Below is a complete script you can run. It supports multiple list pages (pagination) and writes results incrementally.

from urllib.parse import urljoin def scrape(cfg: ScrapeConfig, start_paths: list[str], out_csv: str) -> None: session = build_session(cfg) seen = load_seen_urls(out_csv) detail_urls: list[str] = [] # 1) Collect detail URLs from list pages for path in start_paths: list_url = urljoin(cfg.base_url, path) print(f"[LIST] {list_url}") html = fetch_html(session, list_url, cfg) urls = parse_list_page(html, cfg.base_url) print(f" found {len(urls)} candidate detail urls") detail_urls.extend(urls) polite_sleep(cfg) # De-duplicate candidates unique_detail_urls = [] seen_candidates = set() for u in detail_urls: if u not in seen_candidates: seen_candidates.add(u) unique_detail_urls.append(u) # 2) Visit each detail page and write incrementally written = 0 for i, url in enumerate(unique_detail_urls, start=1): if url in seen: continue try: print(f"[{i}/{len(unique_detail_urls)}] {url}") html = fetch_html(session, url, cfg) item = parse_detail_page(html, url) append_rows(out_csv, [item]) seen.add(url) written += 1 except requests.HTTPError as e: # For junior-friendly logging, keep it simple print(f" HTTP error for {url}: {e}") except Exception as e: print(f" Error for {url}: {e}") polite_sleep(cfg) print(f"Done. Added {written} new rows to {out_csv}.") if __name__ == "__main__": cfg = ScrapeConfig(base_url="https://example.com") # Replace these with real list endpoints for your target site start_paths = [ "/products?page=1", "/products?page=2", ] scrape(cfg, start_paths, out_csv="items.csv")

How to adapt this quickly:

  • Set base_url to your target site.
  • Update start_paths to match the list/pagination URLs.
  • Tune parse_list_page() to detect real detail URLs (often easiest by inspecting the site’s link patterns).
  • Tune parse_detail_page() for the fields you actually need.

7) Practical Debugging Tips (When It Doesn’t Work)

  • Check what you fetched: print resp.status_code and save a sample HTML to disk. Sometimes you’re receiving a “bot check” page instead of the real content.
  • Log the first match: if title is UNKNOWN, print soup.title and the first h1 content to see what the page actually contains.
  • Handle dynamic sites: if the page renders content via JS, requests won’t see it. Options:
    • Look for an underlying JSON endpoint (Network tab in DevTools).
    • Use a browser automation tool (Selenium/Playwright) only when necessary.
  • Don’t scrape too fast: if you get many 429 responses, increase delays and reduce concurrency.

8) Responsible Scraping Checklist

Before scraping a site, confirm you’re allowed to do so. Keep your load low, respect rate limits, and avoid collecting sensitive data. If the site offers an API, prefer it—APIs are usually more stable and intended for automated access.

  • Throttle requests (you already do).
  • Retry transient failures but don’t loop forever.
  • Store results incrementally.
  • Expect missing fields and HTML changes.

Next Step: Add Pagination Discovery (Optional Upgrade)

If your list pages include a “Next” link, you can auto-discover pages instead of hardcoding start_paths. Add a function that finds a next-page URL and loop until it’s missing. The same resilient patterns above still apply.

With this structure in place, you’ve got a scraper that’s not just a demo—it’s a small tool you can run repeatedly, resume after failures, and adapt to new targets without rewriting everything.


Leave a Reply

Your email address will not be published. Required fields are marked *