Python Web Scraping That Doesn’t Break: Requests + BeautifulSoup with Retries, Throttling, and Clean Data Output
Web scraping is easy when you copy/paste a few lines from a tutorial—until the first time the site rate-limits you, changes a CSS class, or returns “Access Denied” intermittently. This hands-on guide shows a junior/mid-friendly pattern for scraping pages reliably using requests + BeautifulSoup, with practical upgrades like retries, timeouts, throttling, user-agent headers, and structured output to CSV/SQLite.
Assumptions: You’re scraping public pages you’re allowed to access. Always check a site’s Terms of Service and robots.txt, and be respectful with request rates.
Setup
Create a virtual environment and install dependencies:
python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate pip install requests beautifulsoup4 lxml
We’ll use:
requestsfor HTTPBeautifulSoupfor parsing HTMLlxmlas a fast parser backend
A “production-ish” HTTP client: headers, timeouts, retries, backoff
Scrapers fail most often because of flaky networking or temporary blocks. Start by centralizing your HTTP logic:
import random import time from typing import Optional import requests from requests import Response from requests.adapters import HTTPAdapter from urllib3.util.retry import Retry def build_session() -> requests.Session: session = requests.Session() # A realistic User-Agent reduces “default bot” suspicion. session.headers.update({ "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 " "(KHTML, like Gecko) Chrome/122.0 Safari/537.36", "Accept-Language": "en-US,en;q=0.9", }) # Retry strategy: retries on common transient errors + rate-limits. retry = Retry( total=5, connect=5, read=5, status=5, backoff_factor=0.8, # exponential-ish backoff: 0.8, 1.6, 3.2... status_forcelist=[429, 500, 502, 503, 504], allowed_methods=["GET"], raise_on_status=False, ) adapter = HTTPAdapter(max_retries=retry, pool_connections=10, pool_maxsize=10) session.mount("http://", adapter) session.mount("https://", adapter) return session def polite_sleep(min_s: float = 0.6, max_s: float = 1.4) -> None: time.sleep(random.uniform(min_s, max_s)) def fetch(session: requests.Session, url: str, timeout: float = 15.0) -> Optional[Response]: polite_sleep() resp = session.get(url, timeout=timeout) # If the site returns a hard block page, treat it as failure. if resp.status_code >= 400: return None return resp
Why this matters:
timeoutprevents hanging requests.Retryhandles temporary issues (including429rate limit).polite_sleep()reduces load and bot detection risk.
Parsing HTML safely: don’t assume everything exists
HTML changes. Your code shouldn’t crash because a price tag is missing. Use helper functions to extract text with sensible fallbacks:
from bs4 import BeautifulSoup from bs4.element import Tag from typing import Optional def text_or_none(node: Optional[Tag]) -> Optional[str]: if not node: return None return node.get_text(strip=True) or None def attr_or_none(node: Optional[Tag], attr: str) -> Optional[str]: if not node: return None return node.get(attr) or None def parse_listing(html: str) -> list[dict]: soup = BeautifulSoup(html, "lxml") # Example: a page with multiple "cards" cards = soup.select(".card") # Adjust selector per site items: list[dict] = [] for card in cards: title = text_or_none(card.select_one(".card-title")) price = text_or_none(card.select_one(".price")) link = attr_or_none(card.select_one("a"), "href") # Normalize relative links if needed later items.append({ "title": title, "price": price, "link": link, }) return items
Tip: Prefer stable selectors. If a site has semantic attributes (like data-testid) or consistent structure, use those over “random” class names.
End-to-end example: scrape multiple pages + dedupe + export to CSV
Let’s build a small CLI-like script that:
- Visits a list of paginated URLs
- Parses items
- De-duplicates by link/title
- Exports to
output.csv
import csv from urllib.parse import urljoin BASE_URL = "https://example.com" LISTING_URLS = [ "https://example.com/products?page=1", "https://example.com/products?page=2", "https://example.com/products?page=3", ] def normalize_item(item: dict) -> dict: # Convert relative link to absolute if item.get("link"): item["link"] = urljoin(BASE_URL, item["link"]) return item def export_csv(rows: list[dict], path: str) -> None: fieldnames = ["title", "price", "link"] with open(path, "w", newline="", encoding="utf-8") as f: writer = csv.DictWriter(f, fieldnames=fieldnames) writer.writeheader() writer.writerows(rows) def main() -> None: session = build_session() all_items: list[dict] = [] for url in LISTING_URLS: resp = fetch(session, url) if not resp: print(f"Failed to fetch: {url}") continue items = parse_listing(resp.text) all_items.extend(normalize_item(x) for x in items) print(f"{url}: +{len(items)} items") # Dedupe by link (or title if link missing) seen = set() deduped: list[dict] = [] for item in all_items: key = item.get("link") or item.get("title") if not key or key in seen: continue seen.add(key) deduped.append(item) export_csv(deduped, "output.csv") print(f"Exported {len(deduped)} unique items to output.csv") if __name__ == "__main__": main()
Swap BASE_URL, LISTING_URLS, and CSS selectors in parse_listing() to match your target site.
Pagination the robust way: discover “next” links instead of guessing
Hardcoding page numbers works until the site changes. A better approach is “follow the next link until it disappears”:
from urllib.parse import urljoin def find_next_page(html: str, current_url: str) -> str | None: soup = BeautifulSoup(html, "lxml") next_a = soup.select_one("a[rel='next']") or soup.select_one("a.next") href = attr_or_none(next_a, "href") return urljoin(current_url, href) if href else None def scrape_all_pages(start_url: str, max_pages: int = 50) -> list[dict]: session = build_session() url = start_url page = 0 items: list[dict] = [] while url and page < max_pages: page += 1 resp = fetch(session, url) if not resp: print(f"Failed page {page}: {url}") break items.extend(parse_listing(resp.text)) url = find_next_page(resp.text, url) print(f"Scraped page {page}, next={url}") return items
This pattern is more resilient because it follows the site’s own navigation.
Persisting data to SQLite for free querying
CSV is nice, but SQLite makes it easy to query results later (and avoids duplicates using a unique constraint):
import sqlite3 CREATE_SQL = """ CREATE TABLE IF NOT EXISTS items ( id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT, price TEXT, link TEXT UNIQUE ); """ INSERT_SQL = """ INSERT OR IGNORE INTO items (title, price, link) VALUES (?, ?, ?); """ def save_to_sqlite(db_path: str, rows: list[dict]) -> None: conn = sqlite3.connect(db_path) try: conn.execute(CREATE_SQL) for r in rows: conn.execute(INSERT_SQL, (r.get("title"), r.get("price"), r.get("link"))) conn.commit() finally: conn.close()
Now you can run:
sqlite3 scraped.db "SELECT COUNT(*) FROM items;" sqlite3 scraped.db "SELECT title, link FROM items LIMIT 10;"
Common failure modes (and fixes)
-
You get empty results, but the page loads in your browser.
The content might be rendered by JavaScript.requestsdownloads the initial HTML, not the post-JS DOM. Options: find an underlying JSON API the page calls, or use a browser automation tool when necessary. -
Random 403/429 errors.
Slow down (increasepolite_sleep), add jitter, rotate through a small set of user agents, and avoid scraping too fast. Retries help, but the real fix is respecting rate limits. -
Selectors break after a redesign.
Add defensive parsing, log samples of HTML when parsing fails, and prefer stable selectors (semantic attributes, consistent headings, URL patterns). -
Data is messy (prices, whitespace, inconsistent formats).
Normalize right after parsing: strip currency symbols, convert to decimals, standardize dates, and store both raw and cleaned fields if needed.
Checklist you can reuse on your next scraper
Sessionwith headers + connection pooling- Timeouts on every request
- Retries with backoff for
429and5xx - Throttling with jitter (
sleeprandomization) - HTML parsing with fallbacks (never assume elements exist)
- Dedupe strategy (unique key like
link) - Export to CSV and/or persist to SQLite
- Logs for failures and samples for debugging
If you want, tell me what kind of site you’re scraping (e.g., product listings, job boards, docs pages) and I’ll tailor the CSS selectors, pagination strategy, and data cleaning to that structure.
Leave a Reply