Python Web Scraping for Real Projects: Requests + BeautifulSoup, Pagination, Retries, and Clean Data Output

Python Web Scraping for Real Projects: Requests + BeautifulSoup, Pagination, Retries, and Clean Data Output

Web scraping sounds easy until you hit the realities: pages that fail randomly, rate limits, inconsistent HTML, pagination quirks, and data you can’t reliably parse twice the same way. This hands-on guide shows a practical approach to scraping with requests + BeautifulSoup, adding the pieces you need for production-ish reliability: headers, timeouts, retries with backoff, polite rate limiting, pagination, and exporting to CSV (or SQLite).

Note: Scrape responsibly. Check a site’s terms and robots rules, don’t overload servers, and avoid collecting sensitive personal data.

What We’ll Build

We’ll create a small scraper that:

  • Fetches multiple pages (pagination)
  • Parses items from HTML using stable selectors
  • Handles transient failures with retries + backoff
  • Normalizes data and exports to CSV

You can adapt the same skeleton to scrape job boards, product lists, docs sites, or internal admin pages (when permitted).

Project Setup

Install dependencies:

python -m venv .venv # macOS/Linux source .venv/bin/activate # Windows # .venv\Scripts\activate
pip install requests beautifulsoup4 lxml

Why lxml? It’s fast and makes HTML parsing more forgiving.

Step 1: A Robust HTTP Client (Headers, Timeouts, Retries)

Many “it worked once” scrapers fail because they don’t set timeouts, don’t identify themselves, or crash on temporary 502/503 errors. Let’s build a small fetch helper.

import random import time from typing import Optional import requests from requests import Response DEFAULT_HEADERS = { # A realistic User-Agent reduces basic blocks and is polite for admins. "User-Agent": "Mozilla/5.0 (compatible; JuniorScraper/1.0; +https://example.com/bot)", "Accept-Language": "en-US,en;q=0.9", } def fetch( url: str, session: requests.Session, *, headers: Optional[dict] = None, timeout: tuple[float, float] = (5.0, 20.0), max_retries: int = 4, base_backoff: float = 0.8, max_backoff: float = 8.0, ) -> Response: """ Fetch a URL with retries + exponential backoff + jitter. timeout is (connect_timeout, read_timeout). """ merged_headers = {**DEFAULT_HEADERS, **(headers or {})} last_exc: Optional[Exception] = None for attempt in range(1, max_retries + 1): try: resp = session.get(url, headers=merged_headers, timeout=timeout) # Retry on common transient errors (server overload, gateway issues) if resp.status_code in (429, 500, 502, 503, 504): raise requests.HTTPError(f"Transient HTTP {resp.status_code}", response=resp) resp.raise_for_status() return resp except (requests.Timeout, requests.ConnectionError, requests.HTTPError) as exc: last_exc = exc if attempt == max_retries: break # Exponential backoff + jitter sleep_for = min(max_backoff, base_backoff * (2 ** (attempt - 1))) sleep_for += random.uniform(0, 0.3) # jitter time.sleep(sleep_for) raise RuntimeError(f"Failed to fetch after {max_retries} attempts: {url}") from last_exc

This alone will save you hours. Notice we retry on 429 (rate limit) and common 5xx errors.

Step 2: Parse HTML with Stable Selectors

The most common scraping bug is “selector drift”: your scraper relied on a brittle CSS path. Prefer:

  • Semantic attributes: data-*, aria-label, stable class names
  • Anchors you control (for internal apps)
  • Defensive parsing (missing fields shouldn’t crash the run)

Below is a template parser. You’ll need to adjust selectors for your target page (open DevTools → Inspect → find repeatable “cards” or rows).

from bs4 import BeautifulSoup def parse_list_page(html: str) -> list[dict]: soup = BeautifulSoup(html, "lxml") # Example structure: # <div class="item-card"> # <a class="item-card__title" href="/item/123">Title</a> # <span class="item-card__meta">$19.99</span> # </div> items: list[dict] = [] for card in soup.select(".item-card"): title_el = card.select_one(".item-card__title") price_el = card.select_one(".item-card__meta") title = title_el.get_text(strip=True) if title_el else "" href = title_el.get("href") if title_el else None price_text = price_el.get_text(strip=True) if price_el else "" # Normalize/clean if not title: continue # Skip empty cards items.append({ "title": title, "url": href, "price_raw": price_text, }) return items

Key ideas:

  • select/select_one uses CSS selectors (friendly for web devs).
  • Each field is optional; missing elements become empty strings.
  • We skip obviously broken/empty records.

Step 3: Handle Pagination Without Guessing

Pagination patterns vary: query params (?page=2), “Next” links, or cursor tokens. The most reliable method is: parse the “Next” link from HTML rather than invent URLs.

from urllib.parse import urljoin def find_next_page_url(html: str, current_url: str) -> str | None: soup = BeautifulSoup(html, "lxml") # Common pattern: <a rel="next" href="?page=2">Next</a> next_link = soup.select_one('a[rel="next"]') if next_link and next_link.get("href"): return urljoin(current_url, next_link["href"]) # Fallback pattern: a link with "Next" text for a in soup.select("a"): if a.get_text(strip=True).lower() in ("next", "next →", "older"): href = a.get("href") if href: return urljoin(current_url, href) return None

Using urljoin correctly handles relative URLs.

Step 4: Put It Together (Scrape Multiple Pages)

This runner:

  • Starts at a URL
  • Repeats: fetch → parse items → find next → sleep politely
  • Stops when no “next” link exists or after a page limit
import csv import time import requests def scrape_all(start_url: str, *, max_pages: int = 10, delay_s: float = 1.0) -> list[dict]: results: list[dict] = [] with requests.Session() as session: url = start_url for page_num in range(1, max_pages + 1): resp = fetch(url, session) html = resp.text page_items = parse_list_page(html) results.extend(page_items) next_url = find_next_page_url(html, url) print(f"Page {page_num}: {len(page_items)} items (total {len(results)}) -> {next_url or 'END'}") if not next_url: break url = next_url time.sleep(delay_s) # polite delay return results def write_csv(rows: list[dict], filename: str) -> None: if not rows: print("No rows to write.") return fieldnames = sorted(rows[0].keys()) with open(filename, "w", newline="", encoding="utf-8") as f: w = csv.DictWriter(f, fieldnames=fieldnames) w.writeheader() w.writerows(rows) if __name__ == "__main__": data = scrape_all("https://example.com/items", max_pages=5, delay_s=1.2) write_csv(data, "items.csv") print("Done.")

Swap https://example.com/items and selectors in parse_list_page to match your target site.

Step 5: Data Cleanup (Example: Parse Price Safely)

Scraped strings are messy. Keep both the raw value and a cleaned value when possible.

import re from decimal import Decimal, InvalidOperation _price_re = re.compile(r"([0-9]+(?:\.[0-9]{1,2})?)") def parse_price(price_raw: str) -> Decimal | None: # Extract first number like 19.99 from "$19.99" or "USD 19.99" m = _price_re.search(price_raw.replace(",", "")) if not m: return None try: return Decimal(m.group(1)) except InvalidOperation: return None

Then enrich rows before writing:

for row in data: row["price"] = str(parse_price(row.get("price_raw", "")) or "")

Common Failure Modes (and Quick Fixes)

  • Blocked or served a “bot check” page: confirm your HTML really contains the items. Save resp.text to a file and open it. You may need authentication, cookies, or a different approach.
  • Inconsistent HTML: always treat fields as optional; log “bad cards” instead of crashing.
  • Rate limiting (429): increase delay_s, add more backoff, reduce concurrency.
  • Duplicate items across pages: dedupe using a stable key (e.g., item URL or ID) before exporting.

Next Steps

Once this baseline works, you can level it up:

  • Export to SQLite for incremental runs (only insert new items)
  • Add structured logging and a --resume flag
  • Use a config file for selectors so non-Python devs can tweak parsing
  • Add unit tests: feed saved HTML into parse_list_page() and verify output

Most importantly: treat your scraper like a small production service. Add retries, validate the HTML you’re actually receiving, and normalize data early. You’ll spend less time chasing “it broke overnight” mysteries and more time shipping useful datasets.


Leave a Reply

Your email address will not be published. Required fields are marked *