Python Web Scraping for Real Projects: Build a Resilient Scraper (Requests + BeautifulSoup + Retries + CSV)
Web scraping sounds simple: fetch a page, parse HTML, save data. In real projects, pages fail, HTML changes, you get blocked, and your “quick script” becomes a flaky mess. This hands-on guide shows a practical pattern for junior/mid developers: a small but resilient Python scraper with retries, timeouts, polite rate limiting, and clean parsing.
We’ll scrape a well-known practice website (books.toscrape.com) and export results to a CSV. The techniques apply to your own internal pages or permitted targets.
Before You Start: Scraping Rules You Should Actually Follow
-
Get permission for anything beyond personal learning. Many sites forbid scraping in their terms.
-
Be polite: add delays, avoid hammering endpoints, and identify your client with a
User-Agent. -
Expect change: HTML selectors will break. Write parsers that fail loudly and are easy to update.
-
Prefer official APIs when available—less brittle, less legal risk.
Project Setup
Create a virtual environment and install dependencies:
python -m venv .venv # macOS/Linux source .venv/bin/activate # Windows # .venv\Scripts\activate pip install requests beautifulsoup4 lxml tenacity
Why these packages?
-
requests: HTTP client -
beautifulsoup4+lxml: fast HTML parsing -
tenacity: clean retries with backoff (better than ad-hoctry/except)
Core Design: Separate Fetching, Parsing, and Persistence
A maintainable scraper usually has three layers:
-
Fetcher: HTTP, retries, timeouts, headers, rate limiting
-
Parser: turn HTML into structured data (dicts / dataclasses)
-
Writer: save to CSV/JSON/DB
This separation makes failures easier to debug and selectors easier to update.
Step 1: A Reliable HTTP Fetcher (Timeouts + Retries + Backoff)
Create scrape_books.py:
import csv import random import time from dataclasses import dataclass from typing import Iterable, Optional from urllib.parse import urljoin import requests from bs4 import BeautifulSoup from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type BASE_URL = "https://books.toscrape.com/" DEFAULT_HEADERS = { "User-Agent": "Mozilla/5.0 (compatible; PracticalScraper/1.0; +https://example.com/bot)" } class FetchError(RuntimeError): pass session = requests.Session() session.headers.update(DEFAULT_HEADERS) @retry( reraise=True, stop=stop_after_attempt(4), wait=wait_exponential_jitter(initial=1, max=8), retry=retry_if_exception_type((requests.RequestException, FetchError)), ) def fetch(url: str, timeout: float = 15.0) -> str: """ Fetch HTML with retries. Retries on network issues and selected HTTP errors. """ resp = session.get(url, timeout=timeout) # Retry on transient server problems (and some blocking pages) if resp.status_code in (429, 500, 502, 503, 504): raise FetchError(f"Transient HTTP {resp.status_code} for {url}") resp.raise_for_status() return resp.text def polite_sleep(min_s: float = 0.6, max_s: float = 1.5) -> None: """Small random delay to reduce load and avoid looking like a bot.""" time.sleep(random.uniform(min_s, max_s))
What this buys you:
-
timeoutprevents hanging requests -
Retries with exponential backoff avoid “fails once, script dies”
-
Sessionkeeps connections warm (faster, fewer TCP handshakes)
Step 2: Parse Books Cleanly (and Fail Loudly)
We’ll parse each listing page into structured data. Use a dataclass so your results are typed and consistent.
@dataclass class Book: title: str price_gbp: float availability: str rating: Optional[int] product_url: str RATING_MAP = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5} def parse_book_list(html: str, page_url: str) -> tuple[list[Book], Optional[str]]: """ Returns (books, next_page_url). """ soup = BeautifulSoup(html, "lxml") articles = soup.select("article.product_pod") if not articles: # If this happens, the site structure likely changed. raise ValueError("No book entries found. Selector may be outdated.") books: list[Book] = [] for a in articles: title_el = a.select_one("h3 a") price_el = a.select_one(".price_color") avail_el = a.select_one(".availability") rating_el = a.select_one("p.star-rating") if not (title_el and price_el and avail_el): # Fail loudly: you want to notice changes early. raise ValueError("Missing expected fields on a book card.") title = title_el.get("title", "").strip() # Price like "£51.77" price_text = price_el.get_text(strip=True).replace("£", "") price = float(price_text) availability = " ".join(avail_el.get_text(" ", strip=True).split()) # Rating is stored as a class: <p class="star-rating Three"> rating = None if rating_el: classes = rating_el.get("class", []) # Find known rating word rating_word = next((c for c in classes if c in RATING_MAP), None) rating = RATING_MAP.get(rating_word) if rating_word else None rel_url = title_el.get("href") product_url = urljoin(page_url, rel_url) books.append( Book( title=title, price_gbp=price, availability=availability, rating=rating, product_url=product_url, ) ) next_el = soup.select_one("li.next a") next_url = urljoin(page_url, next_el["href"]) if next_el else None return books, next_url
Parser tips:
-
Use
select/select_onewith CSS selectors (readable, common) -
Normalize text early (
strip, collapse whitespace) -
Throw explicit errors when expected structure disappears—silent failures are worse
Step 3: Pagination Loop + CSV Output
Now glue it together: start at the homepage, follow next links, and write rows to CSV.
def iter_books(start_url: str = BASE_URL, max_pages: int = 5) -> Iterable[Book]: url = start_url pages = 0 while url and pages < max_pages: html = fetch(url) books, next_url = parse_book_list(html, url) for b in books: yield b pages += 1 url = next_url if url: polite_sleep() def write_csv(books: Iterable[Book], path: str) -> int: count = 0 with open(path, "w", newline="", encoding="utf-8") as f: writer = csv.DictWriter( f, fieldnames=["title", "price_gbp", "availability", "rating", "product_url"], ) writer.writeheader() for b in books: writer.writerow( { "title": b.title, "price_gbp": b.price_gbp, "availability": b.availability, "rating": b.rating, "product_url": b.product_url, } ) count += 1 return count
Step 4: Add a Small CLI (So It’s Reusable)
A CLI makes your script easier to run in different environments (local, CI, cron, Docker).
if __name__ == "__main__": import argparse parser = argparse.ArgumentParser(description="Scrape books.toscrape.com into CSV.") parser.add_argument("--pages", type=int, default=5, help="Max pages to scrape (default: 5)") parser.add_argument("--out", type=str, default="books.csv", help="Output CSV path") args = parser.parse_args() books = iter_books(start_url=BASE_URL, max_pages=args.pages) total = write_csv(books, args.out) print(f"Wrote {total} rows to {args.out}")
Run it:
python scrape_books.py --pages 3 --out books.csv
Common Production Problems (and Quick Fixes)
-
You get blocked (429 / captcha / unexpected HTML).
-
Slow down: increase
polite_sleep()range. -
Reduce concurrency (this example is single-threaded already).
-
Check if the site requires JavaScript rendering; if so, consider a headless browser (Playwright) only if allowed.
-
-
Your selectors break.
-
Log the failing URL and save the HTML to a debug file so you can inspect changes.
-
Make selectors as specific as needed, but not overly brittle.
-
-
Data quality issues (missing prices, weird text).
-
Add validation: e.g., ensure
price_gbp > 0, title non-empty. -
Normalize text consistently (whitespace, currency symbols, encoding).
-
-
Performance is slow.
-
Use
requests.Session()(we did). -
Batch writes (we write streaming rows; for DB use bulk inserts).
-
Only fetch what you need (avoid visiting each product page unless required).
-
Small Upgrade: Save the Raw HTML When Parsing Fails
This tiny helper can save you hours when a site changes:
def safe_parse(html: str, page_url: str) -> tuple[list[Book], Optional[str]]: try: return parse_book_list(html, page_url) except Exception as e: fname = f"debug_{int(time.time())}.html" with open(fname, "w", encoding="utf-8") as f: f.write(html) raise RuntimeError(f"Parse failed for {page_url}. Saved HTML to {fname}. Original error: {e}")
If you adopt this, swap parse_book_list with safe_parse inside iter_books.
Wrap-Up: A Scraper You Can Actually Maintain
You now have a scraper that:
-
Uses timeouts and retries with exponential backoff
-
Respects the target with small random delays
-
Parses HTML into a structured
Bookmodel -
Exports clean CSV output and supports CLI flags
Next steps (when you’re ready): store results in SQLite, add structured logging, and write a couple of tests for your parser using saved HTML fixtures. That’s how you turn scraping from a fragile script into a reliable tool.
Leave a Reply