Python Web Scraping in Practice: Build a Resilient Scraper (Requests + BeautifulSoup + Retries + CSV)

Python Web Scraping in Practice: Build a Resilient Scraper (Requests + BeautifulSoup + Retries + CSV)

Web scraping sounds easy: fetch HTML, parse it, done. In real projects, you’ll hit messy markup, pagination, flaky networks, rate limits, and pages that change over time. This guide shows a practical, junior/mid-friendly pattern for building a resilient Python scraper you can actually maintain.

We’ll build a small scraper that:

  • Downloads pages with timeouts, retries, and a polite delay
  • Parses items safely (handling missing fields)
  • Follows pagination
  • Saves results to CSV
  • Is structured so you can test/extend it later

Note: Always review a site’s terms and be respectful: throttle requests, cache where possible, and don’t scrape personal data.

Setup

Install dependencies:

python -m venv .venv # macOS/Linux source .venv/bin/activate # Windows # .venv\Scripts\activate
pip install requests beautifulsoup4 lxml

We’ll use lxml because it’s fast and handles imperfect HTML better than the default parser.

A practical project structure

Put this in scrape.py. Even for a “simple script”, separating fetch/parse/save will keep you sane:

  • fetch_html(url) handles HTTP concerns
  • parse_list_page(html) extracts items + next page URL
  • save_csv(rows) stores output

Step 1: A polite, retrying HTTP client

Network calls fail. Servers throttle. Your code should expect that.

import csv import time import random from urllib.parse import urljoin import requests from bs4 import BeautifulSoup from requests.adapters import HTTPAdapter from urllib3.util.retry import Retry def make_session() -> requests.Session: session = requests.Session() # Identify yourself. Some sites block generic/default user agents. session.headers.update({ "User-Agent": "Mozilla/5.0 (compatible; PracticalScraper/1.0; +https://example.com/bot)" }) retry = Retry( total=5, connect=5, read=5, backoff_factor=0.6, # exponential backoff: 0.6, 1.2, 2.4... status_forcelist=[429, 500, 502, 503, 504], allowed_methods=["GET"], raise_on_status=False, ) adapter = HTTPAdapter(max_retries=retry, pool_connections=10, pool_maxsize=10) session.mount("http://", adapter) session.mount("https://", adapter) return session def polite_sleep(min_s=0.6, max_s=1.4) -> None: # Jitter helps avoid hammering at an exact interval. time.sleep(random.uniform(min_s, max_s)) def fetch_html(session: requests.Session, url: str, timeout_s=15) -> str: resp = session.get(url, timeout=timeout_s) # If the site returns a non-200, still parse sometimes; but usually you want to fail clearly. if resp.status_code >= 400: raise RuntimeError(f"HTTP {resp.status_code} for {url}") return resp.text

Key ideas:

  • Retry handles common transient failures, including 429 Too Many Requests.
  • backoff_factor adds increasing delay between retries.
  • We include a User-Agent so our requests don’t look like a default bot.
  • polite_sleep() slows us down between pages.

Step 2: Parse a list page safely (and handle missing fields)

Scraping usually targets “list pages” (category pages, search results) that contain repeated “cards” for each item.

Let’s assume the list page contains elements like:

  • Item container: .product-card
  • Title: .product-title
  • Price: .price
  • Link: an <a> tag inside the card
  • Pagination link: a.next

You will need to adjust selectors for your target site, but the pattern stays the same.

def text_or_none(el): return el.get_text(strip=True) if el else None def parse_list_page(html: str, base_url: str) -> tuple[list[dict], str | None]: soup = BeautifulSoup(html, "lxml") rows = [] cards = soup.select(".product-card") for card in cards: title = text_or_none(card.select_one(".product-title")) price = text_or_none(card.select_one(".price")) link_el = card.select_one("a[href]") url = urljoin(base_url, link_el["href"]) if link_el else None # Example of defensive parsing: skip totally broken cards if not title or not url: continue rows.append({ "title": title, "price": price, "url": url, }) next_el = soup.select_one("a.next[href]") next_url = urljoin(base_url, next_el["href"]) if next_el else None return rows, next_url

Why this is robust:

  • We treat fields as optional: missing price doesn’t kill the scrape.
  • We use urljoin so relative links become absolute.
  • We skip unusable cards rather than crashing halfway through.

Step 3: Crawl pagination without infinite loops

Pagination is where many scrapers accidentally DDoS a site or loop forever. Track visited URLs.

def crawl_list(start_url: str, max_pages: int = 10) -> list[dict]: session = make_session() all_rows: list[dict] = [] url = start_url visited = set() pages = 0 while url and pages < max_pages: if url in visited: break visited.add(url) print(f"[+] Fetching: {url}") html = fetch_html(session, url) rows, next_url = parse_list_page(html, base_url=url) all_rows.extend(rows) pages += 1 polite_sleep() url = next_url return all_rows

Tips:

  • max_pages is a safety valve for “oops” moments.
  • visited prevents loops caused by weird pagination or tracking parameters.

Step 4: Save results to CSV (with stable columns)

CSV is a great first output format: easy to inspect, import into Excel, or feed into another system.

def save_csv(path: str, rows: list[dict]) -> None: if not rows: print("[!] No rows to save.") return fieldnames = ["title", "price", "url"] # stable order with open(path, "w", newline="", encoding="utf-8") as f: w = csv.DictWriter(f, fieldnames=fieldnames) w.writeheader() for r in rows: w.writerow({ "title": r.get("title"), "price": r.get("price"), "url": r.get("url"), }) print(f"[+] Saved {len(rows)} rows to {path}")

Putting it together: a runnable script

import argparse def main(): parser = argparse.ArgumentParser(description="Practical web scraper (list pages -> CSV)") parser.add_argument("--start", required=True, help="Start URL of the list page") parser.add_argument("--out", default="output.csv", help="Output CSV path") parser.add_argument("--max-pages", type=int, default=10, help="Max pages to crawl") args = parser.parse_args() rows = crawl_list(args.start, max_pages=args.max_pages) save_csv(args.out, rows) if __name__ == "__main__": main()

Run it like:

python scrape.py --start "https://example.com/products?page=1" --out products.csv --max-pages 5

Real-world upgrades you’ll want next

  • Cache responses during development: Save HTML to disk so you don’t refetch the same pages while tuning selectors.
  • Normalize data: Strip currency symbols, parse numbers, standardize whitespace.
  • Handle “detail pages”: If list pages don’t include enough fields, collect URLs first, then fetch each detail page with a second pass (still throttled).
  • Detect blocking: If you start seeing login pages, CAPTCHAs, or empty HTML, log it and stop rather than looping.
  • Prefer official APIs when available: They’re more stable, faster, and often explicitly allowed.

Common scraping pitfalls (and how to avoid them)

  • Selectors break silently: Add checks like “if 0 cards parsed, alert/fail” so you notice site changes.
  • No timeouts: Always set timeout or your script may hang forever.
  • Over-parallelization: Don’t start with threads/async; get correctness first. If you scale up, do it carefully and politely.
  • Ignoring encoding: Write files with encoding="utf-8" and keep text normalized.

Where to go from here

Once you’re comfortable with this pattern, the next step is adding a “detail page” parser and turning your scraper into a small pipeline: collect URLs → fetch details → output normalized data. If you want, tell me what kind of site you’re scraping (e.g., product listings, real estate, job posts) and I’ll adapt the selectors and data-cleaning logic to that domain—still in a maintainable, production-ish style.


Leave a Reply

Your email address will not be published. Required fields are marked *