Python Web Scraping with Requests + BeautifulSoup: Pagination, Robust Selectors, and Export to CSV

Web scraping is a practical skill when you need data that isn’t available via an API: product listings, blog archives, documentation tables, job posts, etc. In this hands-on guide, you’ll build a scraper that:

Fetches multiple pages (pagination)
Uses robust CSS selectors (less brittle than “find the 3rd div”)
Handles timeouts, retries, and polite rate limiting
Extracts and normalizes data
Exports results to CSV

Important: Always check the site’s Terms of Service and robots.txt, and scrape responsibly (low request rate, caching when possible).

Project Setup

Install dependencies:

python -m venv .venv # macOS/Linux source .venv/bin/activate # Windows # .venv\Scripts\activate pip install requests beautifulsoup4 lxml

We’ll use:

requests for HTTP
BeautifulSoup with the fast lxml parser

Scrape a Paginated Listing: A Realistic Workflow

Most scraping tasks look like this:

Open the listing page in your browser
Identify the repeating “card” element for each item
Figure out how pagination works (query params like ?page=2, or “Next” links)
Extract fields from each card (title, price, link, etc.)
Follow item links if you need detail fields

We’ll implement a “listing + details” pattern that works on many sites.

1) A Reliable HTTP Client (Headers, Timeouts, Retries)

Many sites block requests that look like bots. A simple improvement is to send a normal User-Agent and use timeouts. For transient failures, retry with backoff.

import time import random from typing import Optional import requests DEFAULT_HEADERS = { "User-Agent": ( "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 " "(KHTML, like Gecko) Chrome/123.0 Safari/537.36" ), "Accept-Language": "en-US,en;q=0.9", } def fetch(session: requests.Session, url: str, *, timeout: int = 15, retries: int = 3) -> Optional[str]: """ Fetch a URL and return HTML text. Retries on network errors and 5xx responses. """ for attempt in range(1, retries + 1): try: resp = session.get(url, headers=DEFAULT_HEADERS, timeout=timeout) if 500 <= resp.status_code < 600: raise requests.HTTPError(f"Server error {resp.status_code}") resp.raise_for_status() return resp.text except (requests.RequestException, requests.HTTPError) as e: if attempt == retries: print(f"[ERROR] Failed: {url} ({e})") return None # Exponential backoff + jitter sleep_s = (2 ** attempt) + random.uniform(0.1, 0.6) print(f"[WARN] {e} - retrying in {sleep_s:.1f}s (attempt {attempt}/{retries})") time.sleep(sleep_s)

2) Parse Listing Pages with Strong Selectors

When you inspect the listing page HTML, look for stable hooks:

Semantic tags (article, h2, time)
Attributes like data-testid (often very stable)
Meaningful class names (better than auto-generated hashes)

Below is a generic example that expects each item to be in an article tag and contain:

a title link: a.item-link
a price: .price
a short description: .summary

You will adapt selectors to your target site.

from bs4 import BeautifulSoup from urllib.parse import urljoin def parse_listing(html: str, base_url: str) -> list[dict]: soup = BeautifulSoup(html, "lxml") items: list[dict] = [] for card in soup.select("article"): a = card.select_one("a.item-link") if not a or not a.get("href"): continue title = a.get_text(strip=True) link = urljoin(base_url, a["href"]) price_el = card.select_one(".price") price = price_el.get_text(strip=True) if price_el else None summary_el = card.select_one(".summary") summary = summary_el.get_text(" ", strip=True) if summary_el else None items.append({ "title": title, "url": link, "price": price, "summary": summary, }) return items

3) Handle Pagination: Query Params or “Next” Links

Pagination usually comes in two flavors:

Query parameters: /items?page=1, /items?page=2
Next link: a “Next” button whose href points to the next page

Here’s a robust approach: try to find a “next” link. If you can’t, stop.

def find_next_page(html: str, base_url: str) -> str | None: soup = BeautifulSoup(html, "lxml") # Common patterns: rel="next", link text "Next", aria-label, etc. next_a = ( soup.select_one('a[rel="next"]') or soup.select_one('a[aria-label="Next"]') or soup.find("a", string=lambda s: s and s.strip().lower() in {"next", "older", ">"}) ) if next_a and next_a.get("href"): return urljoin(base_url, next_a["href"]) return None

4) Optional: Follow Item Links to Get Details

Often, the listing page doesn’t contain everything. You can fetch each item’s detail page and extract extra fields. This increases requests, so keep it polite.

import re def parse_details(html: str) -> dict: soup = BeautifulSoup(html, "lxml") # Example fields you might find on a detail page h1 = soup.select_one("h1") full_title = h1.get_text(strip=True) if h1 else None # Example: find an email pattern if present (not always allowed/ethical to collect) text = soup.get_text(" ", strip=True) email_match = re.search(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}", text) email = email_match.group(0) if email_match else None return { "full_title": full_title, "email": email, }

Note: Don’t scrape personal data unless you have a legitimate reason and permission. Many sites prohibit it.

5) Put It Together: End-to-End Scraper

This script starts from a listing URL, walks pagination, extracts items, and optionally enriches each item with detail data.

import csv def scrape(start_url: str, *, max_pages: int = 5, delay_range=(0.5, 1.2), fetch_details: bool = False) -> list[dict]: results: list[dict] = [] seen_urls: set[str] = set() with requests.Session() as session: url = start_url for page_num in range(1, max_pages + 1): html = fetch(session, url) if not html: break items = parse_listing(html, base_url=url) print(f"[INFO] Page {page_num}: found {len(items)} items") for item in items: if item["url"] in seen_urls: continue seen_urls.add(item["url"]) if fetch_details: detail_html = fetch(session, item["url"]) if detail_html: item.update(parse_details(detail_html)) # extra politeness for detail page fetches time.sleep(random.uniform(*delay_range)) results.append(item) next_url = find_next_page(html, base_url=url) if not next_url: print("[INFO] No next page link found. Stopping.") break url = next_url # politeness between listing pages time.sleep(random.uniform(*delay_range)) return results def save_csv(rows: list[dict], path: str) -> None: if not rows: print("[WARN] No rows to save.") return # Stable column order: union of keys in case detail fields exist fieldnames = sorted({k for row in rows for k in row.keys()}) with open(path, "w", newline="", encoding="utf-8") as f: w = csv.DictWriter(f, fieldnames=fieldnames) w.writeheader() w.writerows(rows) print(f"[INFO] Saved {len(rows)} rows to {path}") if __name__ == "__main__": START_URL = "https://example.com/items" # <-- replace with a real listing URL you have permission to scrape data = scrape(START_URL, max_pages=10, fetch_details=False) save_csv(data, "output.csv")

Common Scraping Problems (and Fixes)

Problem: Your selectors break.
Fix: Prefer stable attributes like data-testid or semantic tags. Keep selectors short and meaningful. Add defensive checks (like if not a: continue).
Problem: You get blocked (403/429).
Fix: Slow down (time.sleep), add retries with backoff, ensure headers look normal, and don’t parallelize aggressively. Consider caching responses during development.
Problem: Content is missing because it’s rendered by JavaScript.
Fix: Check the Network tab for an underlying JSON endpoint. If there’s no API call to reuse, you may need a browser automation tool (that’s a different approach than requests scraping).
Problem: URLs are relative.
Fix: Always combine with urljoin(base_url, href).
Problem: Inconsistent formatting (prices/dates).
Fix: Normalize values: strip whitespace, remove currency symbols, parse dates with a known format.

Practical Tips for Junior/Mid Developers

Start small: scrape 1 page, print the parsed results, then add pagination.
Log what you’re doing: page number, item count, failures.
Save raw HTML while debugging: it helps you iterate on selectors without re-requesting the site.
Respect limits: don’t hammer servers; a delay of 0.5–2s is often reasonable for personal scripts.
Test selectors in isolation: copy a snippet of HTML into a small test and verify soup.select() works.

Next Steps

Once the basics work, you can level up by adding:

Response caching (store HTML to disk keyed by URL)
Structured validation (e.g., ensure every row has title and url)
Incremental scraping (skip items you already scraped)
JSON export alongside CSV
Better normalization (prices to numbers, dates to ISO strings)

If you want, tell me what kind of site you’re scraping (blog archive, products, jobs) and what fields you need, and I’ll tailor selectors + pagination logic to that layout.

Python Web Scraping with Requests + BeautifulSoup: Pagination, Robust Selectors, and Export to CSV