Python Web Scraping with Requests + BeautifulSoup: A Practical Recipe (With Pagination, Retries, and CSV Export)

Python Web Scraping with Requests + BeautifulSoup: A Practical Recipe (With Pagination, Retries, and CSV Export)

Web scraping sounds easy until you hit the real world: slow servers, transient errors, inconsistent HTML, pagination, and “why did my script suddenly break?” This hands-on guide shows a pragmatic way to scrape a typical listing page (like a blog index, product catalog, or job board) using requests + BeautifulSoup, while keeping the code junior-friendly and production-ish.

What you’ll build: a small scraper that:

  • Fetches multiple pages (pagination)
  • Parses item cards from HTML
  • Handles retries + timeouts politely
  • Normalizes URLs
  • Exports results to CSV (or JSON)

Prereqs: Python 3.10+ recommended.

1) Install dependencies and create a tiny project

Install packages:

python -m pip install requests beautifulsoup4 lxml

We’ll use lxml as the HTML parser because it’s fast and tolerant of messy markup.

2) A realistic target HTML shape (what we’re parsing)

Most listing pages repeat a “card” structure. Imagine HTML like:

<article class="post-card"> <h2><a href="/posts/hello-world">Hello World</a></h2> <time datetime="2026-02-01">Feb 1, 2026</time> <p class="excerpt">A quick intro to...</p> </article>

Your job is to identify stable selectors (class names, tag patterns) and extract fields. Don’t overfit. If classes look auto-generated, prefer structural selectors (like “article then h2 a”).

3) Build a robust HTTP client (sessions, retries, headers)

Start with a requests.Session to reuse connections (faster) and add sane defaults: timeouts, headers, and retry logic for flaky endpoints.

import time import random from dataclasses import dataclass from urllib.parse import urljoin, urlparse, parse_qs import requests from bs4 import BeautifulSoup DEFAULT_TIMEOUT = (5, 20) # (connect timeout, read timeout) def build_session() -> requests.Session: s = requests.Session() s.headers.update({ "User-Agent": "Mozilla/5.0 (compatible; DevBlogScraper/1.0; +https://example.com)", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.9", }) return s def fetch_html(session: requests.Session, url: str, *, max_retries: int = 3) -> str: """ Fetch HTML with basic retry/backoff. Retries on transient errors (429, 5xx) and network exceptions. """ for attempt in range(1, max_retries + 1): try: resp = session.get(url, timeout=DEFAULT_TIMEOUT) # Handle rate limits / transient server issues if resp.status_code in (429, 500, 502, 503, 504): raise requests.HTTPError(f"Transient HTTP {resp.status_code}", response=resp) resp.raise_for_status() return resp.text except (requests.Timeout, requests.ConnectionError, requests.HTTPError) as e: if attempt == max_retries: raise # Exponential backoff + jitter sleep_s = (2 ** (attempt - 1)) + random.uniform(0.0, 0.5) time.sleep(sleep_s) raise RuntimeError("Unreachable")

Why this matters:

  • Timeouts prevent hanging forever.
  • Retries handle the common “works on my wifi” problem.
  • Jitter reduces thundering-herd behavior if many scripts retry at once.

4) Parse a listing page into structured data

Define a small data model and parsing function. Keep parsing logic separate from networking so it’s testable.

@dataclass class PostItem: title: str url: str date_iso: str | None excerpt: str | None def parse_listing(html: str, base_url: str) -> list[PostItem]: soup = BeautifulSoup(html, "lxml") items: list[PostItem] = [] # Selector depends on your site. Adjust this to match “card” blocks. for card in soup.select("article.post-card"): a = card.select_one("h2 a") if not a or not a.get("href"): continue title = a.get_text(strip=True) url = urljoin(base_url, a["href"]) time_el = card.select_one("time[datetime]") date_iso = time_el["datetime"].strip() if time_el and time_el.get("datetime") else None excerpt_el = card.select_one(".excerpt") excerpt = excerpt_el.get_text(" ", strip=True) if excerpt_el else None items.append(PostItem(title=title, url=url, date_iso=date_iso, excerpt=excerpt)) return items

Tip: always normalize URLs with urljoin so relative links become absolute. This saves pain when exporting data.

5) Pagination: follow “next page” links safely

Pagination can be:

  • ?page=2 query parameters
  • /page/2/ path segments
  • A “Next” link button
  • Infinite scroll (needs different approach)

For a junior-friendly scraper, the “Next link” approach is usually most stable. We’ll parse a[rel="next"] if present, and fall back to a CSS selector you can adjust.

def find_next_page(html: str, base_url: str) -> str | None: soup = BeautifulSoup(html, "lxml") next_link = soup.select_one('a[rel="next"]') if not next_link: # Common fallback selector; adjust to match your target. next_link = soup.select_one("a.next, a.pagination-next") if next_link and next_link.get("href"): return urljoin(base_url, next_link["href"]) return None

Safety rule: avoid guessing page numbers forever. Always set a max page limit.

6) Put it together: scrape multiple pages into one list

This function scrapes until there’s no next page (or until a max page cap):

def scrape_all(base_url: str, start_url: str, *, max_pages: int = 10) -> list[PostItem]: session = build_session() url = start_url all_items: list[PostItem] = [] seen_urls: set[str] = set() for _ in range(max_pages): html = fetch_html(session, url) items = parse_listing(html, base_url) for it in items: if it.url in seen_urls: continue seen_urls.add(it.url) all_items.append(it) next_url = find_next_page(html, base_url) if not next_url: break url = next_url # Polite delay (basic). For large scrapes, use robots.txt-aware throttling. time.sleep(random.uniform(0.5, 1.2)) return all_items

What’s the “seen_urls” set for? It avoids duplicates when pagination repeats items (common on “featured posts” pages).

7) Export to CSV (and JSON)

Once you have structured data, exporting is easy. Keep it simple and explicit.

import csv import json from dataclasses import asdict def export_csv(items: list[PostItem], path: str) -> None: with open(path, "w", newline="", encoding="utf-8") as f: w = csv.DictWriter(f, fieldnames=["title", "url", "date_iso", "excerpt"]) w.writeheader() for it in items: w.writerow(asdict(it)) def export_json(items: list[PostItem], path: str) -> None: with open(path, "w", encoding="utf-8") as f: json.dump([asdict(it) for it in items], f, ensure_ascii=False, indent=2)

8) A complete runnable script

Here’s a full example you can paste into scrape_posts.py and run. You just need to set BASE_URL and START_URL to a real listing page on your site.

import time import random import csv import json from dataclasses import dataclass, asdict from urllib.parse import urljoin import requests from bs4 import BeautifulSoup DEFAULT_TIMEOUT = (5, 20) @dataclass class PostItem: title: str url: str date_iso: str | None excerpt: str | None def build_session() -> requests.Session: s = requests.Session() s.headers.update({ "User-Agent": "Mozilla/5.0 (compatible; DevBlogScraper/1.0; +https://example.com)", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.9", }) return s def fetch_html(session: requests.Session, url: str, *, max_retries: int = 3) -> str: for attempt in range(1, max_retries + 1): try: resp = session.get(url, timeout=DEFAULT_TIMEOUT) if resp.status_code in (429, 500, 502, 503, 504): raise requests.HTTPError(f"Transient HTTP {resp.status_code}", response=resp) resp.raise_for_status() return resp.text except (requests.Timeout, requests.ConnectionError, requests.HTTPError): if attempt == max_retries: raise time.sleep((2 ** (attempt - 1)) + random.uniform(0.0, 0.5)) raise RuntimeError("Unreachable") def parse_listing(html: str, base_url: str) -> list[PostItem]: soup = BeautifulSoup(html, "lxml") items: list[PostItem] = [] for card in soup.select("article.post-card"): a = card.select_one("h2 a") if not a or not a.get("href"): continue title = a.get_text(strip=True) url = urljoin(base_url, a["href"]) time_el = card.select_one("time[datetime]") date_iso = time_el["datetime"].strip() if time_el and time_el.get("datetime") else None excerpt_el = card.select_one(".excerpt") excerpt = excerpt_el.get_text(" ", strip=True) if excerpt_el else None items.append(PostItem(title=title, url=url, date_iso=date_iso, excerpt=excerpt)) return items def find_next_page(html: str, base_url: str) -> str | None: soup = BeautifulSoup(html, "lxml") next_link = soup.select_one('a[rel="next"]') if not next_link: next_link = soup.select_one("a.next, a.pagination-next") if next_link and next_link.get("href"): return urljoin(base_url, next_link["href"]) return None def scrape_all(base_url: str, start_url: str, *, max_pages: int = 10) -> list[PostItem]: session = build_session() url = start_url all_items: list[PostItem] = [] seen_urls: set[str] = set() for _ in range(max_pages): html = fetch_html(session, url) items = parse_listing(html, base_url) for it in items: if it.url in seen_urls: continue seen_urls.add(it.url) all_items.append(it) next_url = find_next_page(html, base_url) if not next_url: break url = next_url time.sleep(random.uniform(0.5, 1.2)) return all_items def export_csv(items: list[PostItem], path: str) -> None: with open(path, "w", newline="", encoding="utf-8") as f: w = csv.DictWriter(f, fieldnames=["title", "url", "date_iso", "excerpt"]) w.writeheader() for it in items: w.writerow(asdict(it)) def export_json(items: list[PostItem], path: str) -> None: with open(path, "w", encoding="utf-8") as f: json.dump([asdict(it) for it in items], f, ensure_ascii=False, indent=2) if __name__ == "__main__": BASE_URL = "https://example.com" START_URL = "https://example.com/blog" # listing page items = scrape_all(BASE_URL, START_URL, max_pages=5) print(f"Scraped {len(items)} items") export_csv(items, "posts.csv") export_json(items, "posts.json") print("Wrote posts.csv and posts.json")

9) Common scraping pitfalls (and how to avoid them)

  • Selectors break often: pick stable hooks. If the site is yours, add dedicated attributes like data-testid="post-card" to make scraping (and testing!) reliable.

  • Rate limits: if you see 429s, increase delays and reduce concurrency. Don’t hammer endpoints.

  • JavaScript-rendered content: if the HTML response doesn’t contain the data, you might need the site’s underlying JSON API (best), or a browser automation tool (heavier).

  • Inconsistent dates: always treat optional fields as optional (use None), and normalize later.

  • Duplicate content: dedupe by canonical URL, not title.

10) A small “upgrade path” if you want to level up

Once the basic scraper works, here are practical improvements that don’t add too much complexity:

  • Persist raw HTML snapshots for debugging when parsing breaks.

  • Add logging via logging module instead of print.

  • Use robots.txt and crawl-delay policies when scraping external sites.

  • Add unit tests for parsing: keep sample HTML files in tests/fixtures/.

With this structure—clean separation between fetching, parsing, pagination, and exporting—you can adapt the scraper to most “list + detail” websites without turning it into a tangled script. Swap selectors, adjust pagination detection, and you’re off.


Leave a Reply

Your email address will not be published. Required fields are marked *