Python Web Scraping with Requests + BeautifulSoup: Pagination, Robust Selectors, and Export to CSV
Web scraping is a practical skill when you need data that isn’t available via an API: product listings, blog archives, documentation tables, job posts, etc. In this hands-on guide, you’ll build a scraper that:
- Fetches multiple pages (pagination)
- Uses robust CSS selectors (less brittle than “find the 3rd div”)
- Handles timeouts, retries, and polite rate limiting
- Extracts and normalizes data
- Exports results to CSV
Important: Always check the site’s Terms of Service and robots.txt, and scrape responsibly (low request rate, caching when possible).
Project Setup
Install dependencies:
python -m venv .venv # macOS/Linux source .venv/bin/activate # Windows # .venv\Scripts\activate pip install requests beautifulsoup4 lxml
We’ll use:
requestsfor HTTPBeautifulSoupwith the fastlxmlparser
Scrape a Paginated Listing: A Realistic Workflow
Most scraping tasks look like this:
- Open the listing page in your browser
- Identify the repeating “card” element for each item
- Figure out how pagination works (query params like
?page=2, or “Next” links) - Extract fields from each card (title, price, link, etc.)
- Follow item links if you need detail fields
We’ll implement a “listing + details” pattern that works on many sites.
1) A Reliable HTTP Client (Headers, Timeouts, Retries)
Many sites block requests that look like bots. A simple improvement is to send a normal User-Agent and use timeouts. For transient failures, retry with backoff.
import time import random from typing import Optional import requests DEFAULT_HEADERS = { "User-Agent": ( "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 " "(KHTML, like Gecko) Chrome/123.0 Safari/537.36" ), "Accept-Language": "en-US,en;q=0.9", } def fetch(session: requests.Session, url: str, *, timeout: int = 15, retries: int = 3) -> Optional[str]: """ Fetch a URL and return HTML text. Retries on network errors and 5xx responses. """ for attempt in range(1, retries + 1): try: resp = session.get(url, headers=DEFAULT_HEADERS, timeout=timeout) if 500 <= resp.status_code < 600: raise requests.HTTPError(f"Server error {resp.status_code}") resp.raise_for_status() return resp.text except (requests.RequestException, requests.HTTPError) as e: if attempt == retries: print(f"[ERROR] Failed: {url} ({e})") return None # Exponential backoff + jitter sleep_s = (2 ** attempt) + random.uniform(0.1, 0.6) print(f"[WARN] {e} - retrying in {sleep_s:.1f}s (attempt {attempt}/{retries})") time.sleep(sleep_s)
2) Parse Listing Pages with Strong Selectors
When you inspect the listing page HTML, look for stable hooks:
- Semantic tags (
article,h2,time) - Attributes like
data-testid(often very stable) - Meaningful class names (better than auto-generated hashes)
Below is a generic example that expects each item to be in an article tag and contain:
- a title link:
a.item-link - a price:
.price - a short description:
.summary
You will adapt selectors to your target site.
from bs4 import BeautifulSoup from urllib.parse import urljoin def parse_listing(html: str, base_url: str) -> list[dict]: soup = BeautifulSoup(html, "lxml") items: list[dict] = [] for card in soup.select("article"): a = card.select_one("a.item-link") if not a or not a.get("href"): continue title = a.get_text(strip=True) link = urljoin(base_url, a["href"]) price_el = card.select_one(".price") price = price_el.get_text(strip=True) if price_el else None summary_el = card.select_one(".summary") summary = summary_el.get_text(" ", strip=True) if summary_el else None items.append({ "title": title, "url": link, "price": price, "summary": summary, }) return items
3) Handle Pagination: Query Params or “Next” Links
Pagination usually comes in two flavors:
- Query parameters:
/items?page=1,/items?page=2 - Next link: a “Next” button whose
hrefpoints to the next page
Here’s a robust approach: try to find a “next” link. If you can’t, stop.
def find_next_page(html: str, base_url: str) -> str | None: soup = BeautifulSoup(html, "lxml") # Common patterns: rel="next", link text "Next", aria-label, etc. next_a = ( soup.select_one('a[rel="next"]') or soup.select_one('a[aria-label="Next"]') or soup.find("a", string=lambda s: s and s.strip().lower() in {"next", "older", ">"}) ) if next_a and next_a.get("href"): return urljoin(base_url, next_a["href"]) return None
4) Optional: Follow Item Links to Get Details
Often, the listing page doesn’t contain everything. You can fetch each item’s detail page and extract extra fields. This increases requests, so keep it polite.
import re def parse_details(html: str) -> dict: soup = BeautifulSoup(html, "lxml") # Example fields you might find on a detail page h1 = soup.select_one("h1") full_title = h1.get_text(strip=True) if h1 else None # Example: find an email pattern if present (not always allowed/ethical to collect) text = soup.get_text(" ", strip=True) email_match = re.search(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}", text) email = email_match.group(0) if email_match else None return { "full_title": full_title, "email": email, }
Note: Don’t scrape personal data unless you have a legitimate reason and permission. Many sites prohibit it.
5) Put It Together: End-to-End Scraper
This script starts from a listing URL, walks pagination, extracts items, and optionally enriches each item with detail data.
import csv def scrape(start_url: str, *, max_pages: int = 5, delay_range=(0.5, 1.2), fetch_details: bool = False) -> list[dict]: results: list[dict] = [] seen_urls: set[str] = set() with requests.Session() as session: url = start_url for page_num in range(1, max_pages + 1): html = fetch(session, url) if not html: break items = parse_listing(html, base_url=url) print(f"[INFO] Page {page_num}: found {len(items)} items") for item in items: if item["url"] in seen_urls: continue seen_urls.add(item["url"]) if fetch_details: detail_html = fetch(session, item["url"]) if detail_html: item.update(parse_details(detail_html)) # extra politeness for detail page fetches time.sleep(random.uniform(*delay_range)) results.append(item) next_url = find_next_page(html, base_url=url) if not next_url: print("[INFO] No next page link found. Stopping.") break url = next_url # politeness between listing pages time.sleep(random.uniform(*delay_range)) return results def save_csv(rows: list[dict], path: str) -> None: if not rows: print("[WARN] No rows to save.") return # Stable column order: union of keys in case detail fields exist fieldnames = sorted({k for row in rows for k in row.keys()}) with open(path, "w", newline="", encoding="utf-8") as f: w = csv.DictWriter(f, fieldnames=fieldnames) w.writeheader() w.writerows(rows) print(f"[INFO] Saved {len(rows)} rows to {path}") if __name__ == "__main__": START_URL = "https://example.com/items" # <-- replace with a real listing URL you have permission to scrape data = scrape(START_URL, max_pages=10, fetch_details=False) save_csv(data, "output.csv")
Common Scraping Problems (and Fixes)
-
Problem: Your selectors break.
Fix: Prefer stable attributes likedata-testidor semantic tags. Keep selectors short and meaningful. Add defensive checks (likeif not a: continue). -
Problem: You get blocked (403/429).
Fix: Slow down (time.sleep), add retries with backoff, ensure headers look normal, and don’t parallelize aggressively. Consider caching responses during development. -
Problem: Content is missing because it’s rendered by JavaScript.
Fix: Check the Network tab for an underlying JSON endpoint. If there’s no API call to reuse, you may need a browser automation tool (that’s a different approach thanrequestsscraping). -
Problem: URLs are relative.
Fix: Always combine withurljoin(base_url, href). -
Problem: Inconsistent formatting (prices/dates).
Fix: Normalize values: strip whitespace, remove currency symbols, parse dates with a known format.
Practical Tips for Junior/Mid Developers
-
Start small: scrape 1 page, print the parsed results, then add pagination.
-
Log what you’re doing: page number, item count, failures.
-
Save raw HTML while debugging: it helps you iterate on selectors without re-requesting the site.
-
Respect limits: don’t hammer servers; a delay of
0.5–2sis often reasonable for personal scripts. -
Test selectors in isolation: copy a snippet of HTML into a small test and verify
soup.select()works.
Next Steps
Once the basics work, you can level up by adding:
- Response caching (store HTML to disk keyed by URL)
- Structured validation (e.g., ensure every row has
titleandurl) - Incremental scraping (skip items you already scraped)
- JSON export alongside CSV
- Better normalization (prices to numbers, dates to ISO strings)
If you want, tell me what kind of site you’re scraping (blog archive, products, jobs) and what fields you need, and I’ll tailor selectors + pagination logic to that layout.
Leave a Reply