Python Web Scraping with Requests + BeautifulSoup: Build a Resilient Scraper with Retries, Throttling, and Incremental CSV Output
Web scraping is easy to start and easy to break. A junior-friendly scraper often works once, then fails on the first network hiccup, HTML change, or rate limit. In this hands-on guide, you’ll build a small but solid Python scraper using requests + BeautifulSoup with:
- Reusable HTTP session + safe headers
- Retries with exponential backoff
- Polite throttling (and random jitter)
- Robust parsing (avoid brittle selectors)
- Incremental writes to CSV (resume-friendly)
Use case: scrape a list page of “products/articles” and then visit each detail page to extract fields like title, price, rating, and url. The exact website doesn’t matter—this pattern applies broadly.
1) Project Setup
Install dependencies:
python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate pip install requests beautifulsoup4 lxml
Why lxml? It’s faster and typically more tolerant of messy HTML than Python’s default parser.
2) Start with a “Good Citizen” HTTP Client
Many sites block requests that look like scripts. The goal is not to bypass security, but to behave like a normal browser and reduce accidental blocking.
import random import time from dataclasses import dataclass from typing import Optional import requests from requests.adapters import HTTPAdapter from urllib3.util.retry import Retry @dataclass class ScrapeConfig: base_url: str timeout_s: int = 15 min_delay_s: float = 0.6 max_delay_s: float = 1.4 user_agent: str = ( "Mozilla/5.0 (X11; Linux x86_64) " "AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/122.0 Safari/537.36" ) def build_session(cfg: ScrapeConfig) -> requests.Session: session = requests.Session() session.headers.update({ "User-Agent": cfg.user_agent, "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.8", "Connection": "keep-alive", }) retry = Retry( total=5, connect=5, read=5, backoff_factor=0.7, # exponential backoff status_forcelist=[429, 500, 502, 503, 504], allowed_methods=["GET"], raise_on_status=False, ) adapter = HTTPAdapter(max_retries=retry, pool_connections=20, pool_maxsize=20) session.mount("http://", adapter) session.mount("https://", adapter) return session def polite_sleep(cfg: ScrapeConfig) -> None: time.sleep(random.uniform(cfg.min_delay_s, cfg.max_delay_s)) def fetch_html(session: requests.Session, url: str, cfg: ScrapeConfig) -> str: resp = session.get(url, timeout=cfg.timeout_s) # If the server sends a non-2xx status, raise for most errors if resp.status_code >= 400 and resp.status_code not in (404,): resp.raise_for_status() return resp.text
What you get: automatic retries for transient issues and rate limits (429) plus a simple sleep function to reduce hammering.
3) Parse HTML Without Overfitting to the Page
A common mistake is using hyper-specific CSS selectors that break when the site’s markup shifts. Prefer stable attributes (like data-*) or broader patterns.
Here’s a parsing approach that works with many “card list” pages:
from bs4 import BeautifulSoup from urllib.parse import urljoin def parse_list_page(html: str, base_url: str) -> list[str]: soup = BeautifulSoup(html, "lxml") # Example strategy: # 1) Try to find links that look like item detail pages. # 2) Avoid nav/footer links by scoping to a main container if possible. main = soup.select_one("main") or soup # fallback if there's no <main> links = [] for a in main.select("a[href]"): href = a.get("href", "").strip() if not href: continue # Heuristic: keep only links that look like detail pages # e.g., "/item/123", "/products/widget", "/articles/some-slug" if any(seg in href for seg in ("/item/", "/product", "/products/", "/article", "/articles/")): links.append(urljoin(base_url, href)) # De-duplicate while keeping order seen = set() out = [] for u in links: if u not in seen: seen.add(u) out.append(u) return out
This uses a heuristic, which you can tighten once you know the target site’s URL structure. The goal is to avoid selecting “every link on the page” and scraping junk.
4) Parse Detail Pages into a Structured Record
Detail pages usually have better structure: a product title, maybe a price element, and a rating. Your job is to extract fields in a way that tolerates missing data.
import re from typing import TypedDict class Item(TypedDict): url: str title: str price: Optional[str] rating: Optional[float] def text_or_none(el) -> Optional[str]: if not el: return None txt = el.get_text(strip=True) return txt or None def parse_rating(text: Optional[str]) -> Optional[float]: if not text: return None # Extract first float-like number: "4.6 out of 5" -> 4.6 m = re.search(r"(\d+(?:\.\d+)?)", text) return float(m.group(1)) if m else None def parse_detail_page(html: str, url: str) -> Item: soup = BeautifulSoup(html, "lxml") # Title: try common patterns title = ( text_or_none(soup.select_one("h1")) or text_or_none(soup.select_one("[data-testid='title']")) or text_or_none(soup.select_one(".product-title")) or "UNKNOWN" ) # Price: try common patterns price = ( text_or_none(soup.select_one("[data-testid='price']")) or text_or_none(soup.select_one(".price")) or text_or_none(soup.find(string=re.compile(r"\$\s?\d"))) ) if isinstance(price, str) and len(price) > 60: # If we accidentally captured a long text node, discard it price = None # Rating: try common patterns rating_text = ( text_or_none(soup.select_one("[data-testid='rating']")) or text_or_none(soup.select_one(".rating")) or text_or_none(soup.find(string=re.compile(r"out of 5|stars", re.I))) ) rating = parse_rating(rating_text) return {"url": url, "title": title, "price": price, "rating": rating}
Key idea: you’re not writing “the one correct selector.” You’re writing a small set of fallbacks that cover typical variations and fail gracefully.
5) Incremental CSV Output (So You Can Resume)
Scrapers that hold everything in memory tend to lose work if they crash halfway through. Instead, write each record as you go. Also keep a “seen URLs” set so you don’t duplicate entries if you rerun.
import csv import os from typing import Iterable CSV_FIELDS = ["url", "title", "price", "rating"] def load_seen_urls(csv_path: str) -> set[str]: if not os.path.exists(csv_path): return set() seen = set() with open(csv_path, newline="", encoding="utf-8") as f: reader = csv.DictReader(f) for row in reader: if row.get("url"): seen.add(row["url"]) return seen def append_rows(csv_path: str, rows: Iterable[dict]) -> None: file_exists = os.path.exists(csv_path) with open(csv_path, "a", newline="", encoding="utf-8") as f: writer = csv.DictWriter(f, fieldnames=CSV_FIELDS) if not file_exists: writer.writeheader() for r in rows: writer.writerow(r)
6) Put It Together: Crawl List Pages → Scrape Details
Below is a complete script you can run. It supports multiple list pages (pagination) and writes results incrementally.
from urllib.parse import urljoin def scrape(cfg: ScrapeConfig, start_paths: list[str], out_csv: str) -> None: session = build_session(cfg) seen = load_seen_urls(out_csv) detail_urls: list[str] = [] # 1) Collect detail URLs from list pages for path in start_paths: list_url = urljoin(cfg.base_url, path) print(f"[LIST] {list_url}") html = fetch_html(session, list_url, cfg) urls = parse_list_page(html, cfg.base_url) print(f" found {len(urls)} candidate detail urls") detail_urls.extend(urls) polite_sleep(cfg) # De-duplicate candidates unique_detail_urls = [] seen_candidates = set() for u in detail_urls: if u not in seen_candidates: seen_candidates.add(u) unique_detail_urls.append(u) # 2) Visit each detail page and write incrementally written = 0 for i, url in enumerate(unique_detail_urls, start=1): if url in seen: continue try: print(f"[{i}/{len(unique_detail_urls)}] {url}") html = fetch_html(session, url, cfg) item = parse_detail_page(html, url) append_rows(out_csv, [item]) seen.add(url) written += 1 except requests.HTTPError as e: # For junior-friendly logging, keep it simple print(f" HTTP error for {url}: {e}") except Exception as e: print(f" Error for {url}: {e}") polite_sleep(cfg) print(f"Done. Added {written} new rows to {out_csv}.") if __name__ == "__main__": cfg = ScrapeConfig(base_url="https://example.com") # Replace these with real list endpoints for your target site start_paths = [ "/products?page=1", "/products?page=2", ] scrape(cfg, start_paths, out_csv="items.csv")
How to adapt this quickly:
- Set
base_urlto your target site. - Update
start_pathsto match the list/pagination URLs. - Tune
parse_list_page()to detect real detail URLs (often easiest by inspecting the site’s link patterns). - Tune
parse_detail_page()for the fields you actually need.
7) Practical Debugging Tips (When It Doesn’t Work)
- Check what you fetched: print
resp.status_codeand save a sample HTML to disk. Sometimes you’re receiving a “bot check” page instead of the real content. - Log the first match: if
titleisUNKNOWN, printsoup.titleand the firsth1content to see what the page actually contains. - Handle dynamic sites: if the page renders content via JS,
requestswon’t see it. Options:- Look for an underlying JSON endpoint (Network tab in DevTools).
- Use a browser automation tool (Selenium/Playwright) only when necessary.
- Don’t scrape too fast: if you get many
429responses, increase delays and reduce concurrency.
8) Responsible Scraping Checklist
Before scraping a site, confirm you’re allowed to do so. Keep your load low, respect rate limits, and avoid collecting sensitive data. If the site offers an API, prefer it—APIs are usually more stable and intended for automated access.
- Throttle requests (you already do).
- Retry transient failures but don’t loop forever.
- Store results incrementally.
- Expect missing fields and HTML changes.
Next Step: Add Pagination Discovery (Optional Upgrade)
If your list pages include a “Next” link, you can auto-discover pages instead of hardcoding start_paths. Add a function that finds a next-page URL and loop until it’s missing. The same resilient patterns above still apply.
With this structure in place, you’ve got a scraper that’s not just a demo—it’s a small tool you can run repeatedly, resume after failures, and adapt to new targets without rewriting everything.
Leave a Reply