Python Web Scraping in Practice: Build a Resilient Scraper (Requests + BeautifulSoup + Retries + CSV)
Web scraping sounds easy: fetch HTML, parse it, done. In real projects, you’ll hit messy markup, pagination, flaky networks, rate limits, and pages that change over time. This guide shows a practical, junior/mid-friendly pattern for building a resilient Python scraper you can actually maintain.
We’ll build a small scraper that:
- Downloads pages with timeouts, retries, and a polite delay
- Parses items safely (handling missing fields)
- Follows pagination
- Saves results to
CSV - Is structured so you can test/extend it later
Note: Always review a site’s terms and be respectful: throttle requests, cache where possible, and don’t scrape personal data.
Setup
Install dependencies:
python -m venv .venv # macOS/Linux source .venv/bin/activate # Windows # .venv\Scripts\activate
pip install requests beautifulsoup4 lxml
We’ll use lxml because it’s fast and handles imperfect HTML better than the default parser.
A practical project structure
Put this in scrape.py. Even for a “simple script”, separating fetch/parse/save will keep you sane:
fetch_html(url)handles HTTP concernsparse_list_page(html)extracts items + next page URLsave_csv(rows)stores output
Step 1: A polite, retrying HTTP client
Network calls fail. Servers throttle. Your code should expect that.
import csv import time import random from urllib.parse import urljoin import requests from bs4 import BeautifulSoup from requests.adapters import HTTPAdapter from urllib3.util.retry import Retry def make_session() -> requests.Session: session = requests.Session() # Identify yourself. Some sites block generic/default user agents. session.headers.update({ "User-Agent": "Mozilla/5.0 (compatible; PracticalScraper/1.0; +https://example.com/bot)" }) retry = Retry( total=5, connect=5, read=5, backoff_factor=0.6, # exponential backoff: 0.6, 1.2, 2.4... status_forcelist=[429, 500, 502, 503, 504], allowed_methods=["GET"], raise_on_status=False, ) adapter = HTTPAdapter(max_retries=retry, pool_connections=10, pool_maxsize=10) session.mount("http://", adapter) session.mount("https://", adapter) return session def polite_sleep(min_s=0.6, max_s=1.4) -> None: # Jitter helps avoid hammering at an exact interval. time.sleep(random.uniform(min_s, max_s)) def fetch_html(session: requests.Session, url: str, timeout_s=15) -> str: resp = session.get(url, timeout=timeout_s) # If the site returns a non-200, still parse sometimes; but usually you want to fail clearly. if resp.status_code >= 400: raise RuntimeError(f"HTTP {resp.status_code} for {url}") return resp.text
Key ideas:
Retryhandles common transient failures, including429 Too Many Requests.backoff_factoradds increasing delay between retries.- We include a
User-Agentso our requests don’t look like a default bot. polite_sleep()slows us down between pages.
Step 2: Parse a list page safely (and handle missing fields)
Scraping usually targets “list pages” (category pages, search results) that contain repeated “cards” for each item.
Let’s assume the list page contains elements like:
- Item container:
.product-card - Title:
.product-title - Price:
.price - Link: an
<a>tag inside the card - Pagination link:
a.next
You will need to adjust selectors for your target site, but the pattern stays the same.
def text_or_none(el): return el.get_text(strip=True) if el else None def parse_list_page(html: str, base_url: str) -> tuple[list[dict], str | None]: soup = BeautifulSoup(html, "lxml") rows = [] cards = soup.select(".product-card") for card in cards: title = text_or_none(card.select_one(".product-title")) price = text_or_none(card.select_one(".price")) link_el = card.select_one("a[href]") url = urljoin(base_url, link_el["href"]) if link_el else None # Example of defensive parsing: skip totally broken cards if not title or not url: continue rows.append({ "title": title, "price": price, "url": url, }) next_el = soup.select_one("a.next[href]") next_url = urljoin(base_url, next_el["href"]) if next_el else None return rows, next_url
Why this is robust:
- We treat fields as optional: missing price doesn’t kill the scrape.
- We use
urljoinso relative links become absolute. - We skip unusable cards rather than crashing halfway through.
Step 3: Crawl pagination without infinite loops
Pagination is where many scrapers accidentally DDoS a site or loop forever. Track visited URLs.
def crawl_list(start_url: str, max_pages: int = 10) -> list[dict]: session = make_session() all_rows: list[dict] = [] url = start_url visited = set() pages = 0 while url and pages < max_pages: if url in visited: break visited.add(url) print(f"[+] Fetching: {url}") html = fetch_html(session, url) rows, next_url = parse_list_page(html, base_url=url) all_rows.extend(rows) pages += 1 polite_sleep() url = next_url return all_rows
Tips:
max_pagesis a safety valve for “oops” moments.visitedprevents loops caused by weird pagination or tracking parameters.
Step 4: Save results to CSV (with stable columns)
CSV is a great first output format: easy to inspect, import into Excel, or feed into another system.
def save_csv(path: str, rows: list[dict]) -> None: if not rows: print("[!] No rows to save.") return fieldnames = ["title", "price", "url"] # stable order with open(path, "w", newline="", encoding="utf-8") as f: w = csv.DictWriter(f, fieldnames=fieldnames) w.writeheader() for r in rows: w.writerow({ "title": r.get("title"), "price": r.get("price"), "url": r.get("url"), }) print(f"[+] Saved {len(rows)} rows to {path}")
Putting it together: a runnable script
import argparse def main(): parser = argparse.ArgumentParser(description="Practical web scraper (list pages -> CSV)") parser.add_argument("--start", required=True, help="Start URL of the list page") parser.add_argument("--out", default="output.csv", help="Output CSV path") parser.add_argument("--max-pages", type=int, default=10, help="Max pages to crawl") args = parser.parse_args() rows = crawl_list(args.start, max_pages=args.max_pages) save_csv(args.out, rows) if __name__ == "__main__": main()
Run it like:
python scrape.py --start "https://example.com/products?page=1" --out products.csv --max-pages 5
Real-world upgrades you’ll want next
- Cache responses during development: Save HTML to disk so you don’t refetch the same pages while tuning selectors.
- Normalize data: Strip currency symbols, parse numbers, standardize whitespace.
- Handle “detail pages”: If list pages don’t include enough fields, collect URLs first, then fetch each detail page with a second pass (still throttled).
- Detect blocking: If you start seeing login pages, CAPTCHAs, or empty HTML, log it and stop rather than looping.
- Prefer official APIs when available: They’re more stable, faster, and often explicitly allowed.
Common scraping pitfalls (and how to avoid them)
- Selectors break silently: Add checks like “if 0 cards parsed, alert/fail” so you notice site changes.
- No timeouts: Always set
timeoutor your script may hang forever. - Over-parallelization: Don’t start with threads/async; get correctness first. If you scale up, do it carefully and politely.
- Ignoring encoding: Write files with
encoding="utf-8"and keep text normalized.
Where to go from here
Once you’re comfortable with this pattern, the next step is adding a “detail page” parser and turning your scraper into a small pipeline: collect URLs → fetch details → output normalized data. If you want, tell me what kind of site you’re scraping (e.g., product listings, real estate, job posts) and I’ll adapt the selectors and data-cleaning logic to that domain—still in a maintainable, production-ish style.
Leave a Reply