Python Web Scraping for Real Projects: Build a Resilient Scraper (Requests + BeautifulSoup + Retries + CSV)

Python Web Scraping for Real Projects: Build a Resilient Scraper (Requests + BeautifulSoup + Retries + CSV)

Web scraping sounds simple: fetch a page, parse HTML, save data. In real projects, pages fail, HTML changes, you get blocked, and your “quick script” becomes a flaky mess. This hands-on guide shows a practical pattern for junior/mid developers: a small but resilient Python scraper with retries, timeouts, polite rate limiting, and clean parsing.

We’ll scrape a well-known practice website (books.toscrape.com) and export results to a CSV. The techniques apply to your own internal pages or permitted targets.

Before You Start: Scraping Rules You Should Actually Follow

  • Get permission for anything beyond personal learning. Many sites forbid scraping in their terms.

  • Be polite: add delays, avoid hammering endpoints, and identify your client with a User-Agent.

  • Expect change: HTML selectors will break. Write parsers that fail loudly and are easy to update.

  • Prefer official APIs when available—less brittle, less legal risk.

Project Setup

Create a virtual environment and install dependencies:

python -m venv .venv # macOS/Linux source .venv/bin/activate # Windows # .venv\Scripts\activate pip install requests beautifulsoup4 lxml tenacity

Why these packages?

  • requests: HTTP client

  • beautifulsoup4 + lxml: fast HTML parsing

  • tenacity: clean retries with backoff (better than ad-hoc try/except)

Core Design: Separate Fetching, Parsing, and Persistence

A maintainable scraper usually has three layers:

  • Fetcher: HTTP, retries, timeouts, headers, rate limiting

  • Parser: turn HTML into structured data (dicts / dataclasses)

  • Writer: save to CSV/JSON/DB

This separation makes failures easier to debug and selectors easier to update.

Step 1: A Reliable HTTP Fetcher (Timeouts + Retries + Backoff)

Create scrape_books.py:

import csv import random import time from dataclasses import dataclass from typing import Iterable, Optional from urllib.parse import urljoin import requests from bs4 import BeautifulSoup from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type BASE_URL = "https://books.toscrape.com/" DEFAULT_HEADERS = { "User-Agent": "Mozilla/5.0 (compatible; PracticalScraper/1.0; +https://example.com/bot)" } class FetchError(RuntimeError): pass session = requests.Session() session.headers.update(DEFAULT_HEADERS) @retry( reraise=True, stop=stop_after_attempt(4), wait=wait_exponential_jitter(initial=1, max=8), retry=retry_if_exception_type((requests.RequestException, FetchError)), ) def fetch(url: str, timeout: float = 15.0) -> str: """ Fetch HTML with retries. Retries on network issues and selected HTTP errors. """ resp = session.get(url, timeout=timeout) # Retry on transient server problems (and some blocking pages) if resp.status_code in (429, 500, 502, 503, 504): raise FetchError(f"Transient HTTP {resp.status_code} for {url}") resp.raise_for_status() return resp.text def polite_sleep(min_s: float = 0.6, max_s: float = 1.5) -> None: """Small random delay to reduce load and avoid looking like a bot.""" time.sleep(random.uniform(min_s, max_s))

What this buys you:

  • timeout prevents hanging requests

  • Retries with exponential backoff avoid “fails once, script dies”

  • Session keeps connections warm (faster, fewer TCP handshakes)

Step 2: Parse Books Cleanly (and Fail Loudly)

We’ll parse each listing page into structured data. Use a dataclass so your results are typed and consistent.

@dataclass class Book: title: str price_gbp: float availability: str rating: Optional[int] product_url: str RATING_MAP = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5} def parse_book_list(html: str, page_url: str) -> tuple[list[Book], Optional[str]]: """ Returns (books, next_page_url). """ soup = BeautifulSoup(html, "lxml") articles = soup.select("article.product_pod") if not articles: # If this happens, the site structure likely changed. raise ValueError("No book entries found. Selector may be outdated.") books: list[Book] = [] for a in articles: title_el = a.select_one("h3 a") price_el = a.select_one(".price_color") avail_el = a.select_one(".availability") rating_el = a.select_one("p.star-rating") if not (title_el and price_el and avail_el): # Fail loudly: you want to notice changes early. raise ValueError("Missing expected fields on a book card.") title = title_el.get("title", "").strip() # Price like "£51.77" price_text = price_el.get_text(strip=True).replace("£", "") price = float(price_text) availability = " ".join(avail_el.get_text(" ", strip=True).split()) # Rating is stored as a class: <p class="star-rating Three"> rating = None if rating_el: classes = rating_el.get("class", []) # Find known rating word rating_word = next((c for c in classes if c in RATING_MAP), None) rating = RATING_MAP.get(rating_word) if rating_word else None rel_url = title_el.get("href") product_url = urljoin(page_url, rel_url) books.append( Book( title=title, price_gbp=price, availability=availability, rating=rating, product_url=product_url, ) ) next_el = soup.select_one("li.next a") next_url = urljoin(page_url, next_el["href"]) if next_el else None return books, next_url

Parser tips:

  • Use select/select_one with CSS selectors (readable, common)

  • Normalize text early (strip, collapse whitespace)

  • Throw explicit errors when expected structure disappears—silent failures are worse

Step 3: Pagination Loop + CSV Output

Now glue it together: start at the homepage, follow next links, and write rows to CSV.

def iter_books(start_url: str = BASE_URL, max_pages: int = 5) -> Iterable[Book]: url = start_url pages = 0 while url and pages < max_pages: html = fetch(url) books, next_url = parse_book_list(html, url) for b in books: yield b pages += 1 url = next_url if url: polite_sleep() def write_csv(books: Iterable[Book], path: str) -> int: count = 0 with open(path, "w", newline="", encoding="utf-8") as f: writer = csv.DictWriter( f, fieldnames=["title", "price_gbp", "availability", "rating", "product_url"], ) writer.writeheader() for b in books: writer.writerow( { "title": b.title, "price_gbp": b.price_gbp, "availability": b.availability, "rating": b.rating, "product_url": b.product_url, } ) count += 1 return count

Step 4: Add a Small CLI (So It’s Reusable)

A CLI makes your script easier to run in different environments (local, CI, cron, Docker).

if __name__ == "__main__": import argparse parser = argparse.ArgumentParser(description="Scrape books.toscrape.com into CSV.") parser.add_argument("--pages", type=int, default=5, help="Max pages to scrape (default: 5)") parser.add_argument("--out", type=str, default="books.csv", help="Output CSV path") args = parser.parse_args() books = iter_books(start_url=BASE_URL, max_pages=args.pages) total = write_csv(books, args.out) print(f"Wrote {total} rows to {args.out}")

Run it:

python scrape_books.py --pages 3 --out books.csv

Common Production Problems (and Quick Fixes)

  • You get blocked (429 / captcha / unexpected HTML).

    • Slow down: increase polite_sleep() range.

    • Reduce concurrency (this example is single-threaded already).

    • Check if the site requires JavaScript rendering; if so, consider a headless browser (Playwright) only if allowed.

  • Your selectors break.

    • Log the failing URL and save the HTML to a debug file so you can inspect changes.

    • Make selectors as specific as needed, but not overly brittle.

  • Data quality issues (missing prices, weird text).

    • Add validation: e.g., ensure price_gbp > 0, title non-empty.

    • Normalize text consistently (whitespace, currency symbols, encoding).

  • Performance is slow.

    • Use requests.Session() (we did).

    • Batch writes (we write streaming rows; for DB use bulk inserts).

    • Only fetch what you need (avoid visiting each product page unless required).

Small Upgrade: Save the Raw HTML When Parsing Fails

This tiny helper can save you hours when a site changes:

def safe_parse(html: str, page_url: str) -> tuple[list[Book], Optional[str]]: try: return parse_book_list(html, page_url) except Exception as e: fname = f"debug_{int(time.time())}.html" with open(fname, "w", encoding="utf-8") as f: f.write(html) raise RuntimeError(f"Parse failed for {page_url}. Saved HTML to {fname}. Original error: {e}")

If you adopt this, swap parse_book_list with safe_parse inside iter_books.

Wrap-Up: A Scraper You Can Actually Maintain

You now have a scraper that:

  • Uses timeouts and retries with exponential backoff

  • Respects the target with small random delays

  • Parses HTML into a structured Book model

  • Exports clean CSV output and supports CLI flags

Next steps (when you’re ready): store results in SQLite, add structured logging, and write a couple of tests for your parser using saved HTML fixtures. That’s how you turn scraping from a fragile script into a reliable tool.


Leave a Reply

Your email address will not be published. Required fields are marked *