Python Web Scraping with Requests + BeautifulSoup: A Resilient, Maintainable Scraper (No Framework Magic)
Web scraping sounds easy until your script breaks on the second page, gets blocked, or produces messy data you can’t trust. In this hands-on guide, you’ll build a small but “production-minded” scraper using plain requests + BeautifulSoup—with retries, timeouts, pagination, polite rate limiting, and clean structured output.
We’ll target a “directory-style” site pattern (a list page linking to detail pages), which maps to many real projects: job boards, product catalogs, documentation indexes, etc.
What You’ll Build
- A scraper that fetches list pages and follows links to detail pages
- Safe networking: timeouts, retries, and backoff
- Polite scraping: a configurable delay and a clear
User-Agent - Robust parsing with CSS selectors and defensive fallbacks
- Output to JSONL (great for incremental datasets) and optional CSV
- Checkpointing so you can resume after failures
Install Dependencies
You only need two packages:
python -m pip install requests beautifulsoup4
(Optional) If you want fast, safe CSS selectors:
python -m pip install lxml
Pick a Target Pattern (List → Detail)
This tutorial uses generic selectors so you can adapt it quickly. Assume:
- The list page contains “cards” with a title and a link to a detail page
- The detail page contains fields like description, tags, and a published date
You’ll customize selectors in one place so your scraping logic stays readable.
Step 1: Create a Reusable HTTP Client (Timeouts, Retries, Backoff)
Most scraping pain is networking. Make this solid first.
import time import random from dataclasses import dataclass from typing import Optional, Dict, Any import requests @dataclass class HttpConfig: base_headers: Dict[str, str] timeout_seconds: float = 15.0 max_retries: int = 4 backoff_base: float = 0.8 # exponential backoff base jitter: float = 0.3 # random jitter to avoid thundering herd polite_delay: float = 1.0 # delay between requests class HttpClient: def __init__(self, config: HttpConfig): self.config = config self.session = requests.Session() self.session.headers.update(config.base_headers) def get(self, url: str, *, params: Optional[Dict[str, Any]] = None) -> requests.Response: # Polite delay between requests time.sleep(self.config.polite_delay) last_exc = None for attempt in range(1, self.config.max_retries + 1): try: resp = self.session.get(url, params=params, timeout=self.config.timeout_seconds) # Treat 5xx as retryable; 4xx usually means "don't retry" (but 429 is special) if resp.status_code == 429 or (500 <= resp.status_code <= 599): raise requests.HTTPError(f"Retryable status {resp.status_code}", response=resp) resp.raise_for_status() return resp except (requests.Timeout, requests.ConnectionError, requests.HTTPError) as exc: last_exc = exc # Exponential backoff with jitter sleep_s = (self.config.backoff_base ** attempt) + random.random() * self.config.jitter print(f"[warn] GET failed (attempt {attempt}/{self.config.max_retries}) {url} - {exc}. Backing off {sleep_s:.2f}s") time.sleep(sleep_s) raise RuntimeError(f"Failed to GET after retries: {url}") from last_exc
Why this matters: timeouts prevent hanging; retries handle flaky connections; backoff reduces pressure and helps you avoid bans.
Step 2: Centralize Your Selectors (So You Can Fix Breakage Fast)
Sites change. If selectors are scattered everywhere, you’ll hate your life later. Put them in one config.
from dataclasses import dataclass @dataclass class Selectors: # List page item_card: str = ".card" # container for each item item_title: str = ".card h2" item_link: str = ".card a" # Detail page detail_title: str = "h1" detail_description: str = ".description" detail_tags: str = ".tags a" detail_date: str = "time"
When you adapt this to a real site, update these selectors first.
Step 3: Parse List Pages and Extract Detail URLs
Use BeautifulSoup with defensive checks (missing elements happen).
from bs4 import BeautifulSoup from urllib.parse import urljoin def parse_list_page(html: str, *, base_url: str, sel: Selectors): soup = BeautifulSoup(html, "lxml") # or "html.parser" if you didn't install lxml items = [] for card in soup.select(sel.item_card): title_el = card.select_one(sel.item_title) link_el = card.select_one(sel.item_link) if not link_el or not link_el.get("href"): continue title = title_el.get_text(strip=True) if title_el else "" url = urljoin(base_url, link_el["href"]) items.append({"title_hint": title, "detail_url": url}) return items
Step 4: Parse Detail Pages into Clean Structured Data
On the detail page, extract a normalized record. Keep parsing strict and output consistent.
def text_or_empty(soup: BeautifulSoup, css: str) -> str: el = soup.select_one(css) return el.get_text(" ", strip=True) if el else "" def parse_detail_page(html: str, *, url: str, sel: Selectors): soup = BeautifulSoup(html, "lxml") title = text_or_empty(soup, sel.detail_title) description = text_or_empty(soup, sel.detail_description) tags = [a.get_text(strip=True) for a in soup.select(sel.detail_tags) if a.get_text(strip=True)] date_text = "" time_el = soup.select_one(sel.detail_date) if time_el: # Try datetime attribute first, fallback to visible text date_text = time_el.get("datetime") or time_el.get_text(strip=True) return { "url": url, "title": title, "description": description, "tags": tags, "published": date_text, }
Tip: Keep raw strings if you’re not sure about formats. You can normalize dates later once you confirm the site’s date patterns.
Step 5: Write Output Incrementally (JSONL) + Resume with a Checkpoint
Writing one big JSON at the end is risky: a crash loses everything. JSONL (one JSON per line) is perfect for scrapers.
import json from pathlib import Path def load_checkpoint(path: Path) -> set[str]: if not path.exists(): return set() return set(path.read_text(encoding="utf-8").splitlines()) def save_checkpoint(path: Path, seen: set[str]) -> None: path.write_text("\n".join(sorted(seen)), encoding="utf-8") def append_jsonl(path: Path, record: dict) -> None: with path.open("a", encoding="utf-8") as f: f.write(json.dumps(record, ensure_ascii=False) + "\n")
Step 6: Put It Together (Paginate List Pages, Follow Links)
This is the “main loop”: fetch list pages, extract detail URLs, fetch details, save results, and checkpoint.
def scrape_directory( start_url: str, *, page_param: str = "page", max_pages: int = 5, output_jsonl: str = "output.jsonl", checkpoint_file: str = "seen_urls.txt", ): sel = Selectors() http = HttpClient( HttpConfig( base_headers={ "User-Agent": "ExampleScraper/1.0 (+contact: [email protected])", "Accept-Language": "en-US,en;q=0.8", }, timeout_seconds=15.0, max_retries=4, polite_delay=1.0, ) ) out_path = Path(output_jsonl) ck_path = Path(checkpoint_file) seen = load_checkpoint(ck_path) for page in range(1, max_pages + 1): print(f"[info] Fetching list page {page}/{max_pages}") list_resp = http.get(start_url, params={page_param: page}) list_items = parse_list_page(list_resp.text, base_url=start_url, sel=sel) if not list_items: print("[info] No items found; stopping early.") break for item in list_items: url = item["detail_url"] if url in seen: continue print(f"[info] Fetching detail: {url}") detail_resp = http.get(url) record = parse_detail_page(detail_resp.text, url=url, sel=sel) # Optionally store the list title hint for debugging mismatches if item.get("title_hint") and not record.get("title"): record["title_hint"] = item["title_hint"] append_jsonl(out_path, record) seen.add(url) # Save checkpoint frequently to be crash-safe if len(seen) % 10 == 0: save_checkpoint(ck_path, seen) save_checkpoint(ck_path, seen) print(f"[done] Saved {len(seen)} records to {out_path}")
Run it:
if __name__ == "__main__": scrape_directory( start_url="https://example.com/directory", page_param="page", max_pages=10, output_jsonl="directory.jsonl", checkpoint_file="seen_urls.txt", )
Practical Debugging Checklist When It “Stops Working”
- Selectors broke: print a snippet or save HTML to inspect. Update
Selectorsfirst. - Blocked (403/429): increase
polite_delay, reduce concurrency (we’re single-threaded here), and ensure yourUser-Agentis honest. - Empty output: the list page might load content with JavaScript. If the HTML doesn’t contain items, you’ll need a browser approach—but start by verifying the HTML you received.
- Data is messy: keep raw fields, then normalize in a separate “cleaning” script so scraping stays simple.
Optional: Convert JSONL to CSV
JSONL is great for scraping; CSV is great for spreadsheets. Convert after the scrape:
import json import csv from pathlib import Path def jsonl_to_csv(jsonl_path: str, csv_path: str): rows = [] with Path(jsonl_path).open("r", encoding="utf-8") as f: for line in f: rows.append(json.loads(line)) # Flatten tags into a single string for CSV for r in rows: r["tags"] = ", ".join(r.get("tags", [])) fieldnames = sorted({k for r in rows for k in r.keys()}) with Path(csv_path).open("w", newline="", encoding="utf-8") as f: writer = csv.DictWriter(f, fieldnames=fieldnames) writer.writeheader() writer.writerows(rows) if __name__ == "__main__": jsonl_to_csv("directory.jsonl", "directory.csv")
Key Takeaways
- Start with a reliable HTTP layer (timeouts + retries + backoff). It prevents most scraping failures.
- Centralize selectors so site changes are a quick fix, not a rewrite.
- Write incrementally (JSONL) and checkpoint often so you can resume safely.
- Keep scraping focused on collection; do heavy normalization in a separate pass.
If you want, tell me what kind of site you’re scraping (list/detail structure + a sample HTML snippet), and I’ll adapt the selectors and parsing into a ready-to-run script for that layout.
Leave a Reply