Python Web Scraping in 2026: A Practical, Repeatable Pattern (Requests + BeautifulSoup + Retries + Storage)
Web scraping sounds easy until you hit real-world problems: flaky connections, inconsistent HTML, rate limits, pagination, and “works on my machine” scripts that fall apart the next day. This hands-on guide shows a practical scraping pattern you can reuse across projects: fetch reliably, parse defensively, back off politely, and store results in a structured way.
What you’ll build: a small scraper that collects product-like items from a paginated listing, normalizes fields, and saves them to SQLite (and optionally CSV). The code examples are “drop-in runnable” once you set the target URLs and CSS selectors for your site.
1) Ground rules: scrape responsibly
- Check terms/robots: Many sites forbid scraping. Always confirm you’re allowed.
- Identify yourself: Use a clear
User-Agentand optionally a contact URL/email. - Throttle requests: Sleep between requests and use exponential backoff on errors.
- Avoid unnecessary load: Cache pages while developing and don’t hammer endpoints.
- Prefer APIs when available: If the site has a public API, use it instead of HTML parsing.
2) Project setup
Create a virtual environment and install dependencies:
python -m venv .venv # macOS/Linux source .venv/bin/activate # Windows (PowerShell) # .venv\Scripts\Activate.ps1 pip install requests beautifulsoup4 lxml tenacity
We’ll use:
requestsfor HTTPBeautifulSoup(+lxml) for parsingtenacityfor clean retries with backoffsqlite3from the standard library for storage
3) Reliable fetching: timeouts, retries, and polite throttling
A surprising number of scraping scripts fail because they don’t set timeouts, don’t retry transient errors, and treat every response as valid HTML. Start with a hardened fetch function:
import time import random import requests from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type SESSION = requests.Session() SESSION.headers.update({ "User-Agent": "ExampleScraper/1.0 (+https://your-site.example/scraper-info)" }) class FetchError(Exception): pass @retry( reraise=True, stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=1, max=20), retry=retry_if_exception_type((requests.RequestException, FetchError)), ) def fetch(url: str) -> str: # Small jitter helps avoid “thundering herd” patterns. time.sleep(random.uniform(0.6, 1.4)) resp = SESSION.get(url, timeout=(5, 20)) # (connect timeout, read timeout) # Handle basic rate limiting or temporary blocks if resp.status_code in (429, 503): raise FetchError(f"Rate-limited or unavailable: {resp.status_code}") resp.raise_for_status() # Basic sanity check: ensure we're getting HTML content_type = resp.headers.get("Content-Type", "") if "text/html" not in content_type and "application/xhtml+xml" not in content_type: raise FetchError(f"Unexpected Content-Type: {content_type}") return resp.text
This function gives you:
- Timeouts so the script doesn’t hang
- Retry with exponential backoff for transient failures
- Basic detection of rate limiting
- Content-Type sanity checks
4) Parse defensively: CSS selectors + “best effort” fields
HTML changes. A robust scraper doesn’t assume everything exists; it extracts fields carefully and defaults gracefully.
Let’s assume the listing page has “cards” like:
- Title:
.card-title - Price:
.price - Item URL: link on
a.card-link
Here’s a parser that returns normalized items:
from bs4 import BeautifulSoup from urllib.parse import urljoin def clean_text(value: str | None) -> str: if not value: return "" return " ".join(value.split()).strip() def parse_listing(html: str, base_url: str) -> list[dict]: soup = BeautifulSoup(html, "lxml") items: list[dict] = [] for card in soup.select(".card"): title = clean_text(card.select_one(".card-title") and card.select_one(".card-title").get_text()) price = clean_text(card.select_one(".price") and card.select_one(".price").get_text()) link_el = card.select_one("a.card-link") href = link_el.get("href") if link_el else None url = urljoin(base_url, href) if href else "" if not title and not url: # Skip malformed cards continue items.append({ "title": title, "price_raw": price, "url": url, }) return items
Tip: Keep your selectors near the top of the file (or in a config) so updating them is painless.
5) Pagination: stop conditions that won’t loop forever
Many sites paginate via ?page=2, “Next” links, or infinite scrolling. For junior/mid devs, the simplest reliable pattern is: build a “next page URL,” fetch it, parse items, and stop when either:
- No items were found, or
- You detect there is no “Next” link, or
- You hit a max page limit (safety)
Example using a ?page= parameter:
def iter_pages(base_list_url: str, max_pages: int = 20): # Example: https://example.com/products?page=1 for page in range(1, max_pages + 1): yield f"{base_list_url}?page={page}"
6) Store results properly: SQLite with upserts
Saving to a JSON file works for tiny scripts, but SQLite is a great “next step” because it’s built-in, queryable, and supports uniqueness constraints. We’ll store items keyed by URL.
import sqlite3 from datetime import datetime, timezone DB_PATH = "scraped_items.sqlite" def init_db(): with sqlite3.connect(DB_PATH) as conn: conn.execute(""" CREATE TABLE IF NOT EXISTS items ( url TEXT PRIMARY KEY, title TEXT NOT NULL, price_raw TEXT, scraped_at TEXT NOT NULL ) """) conn.commit() def upsert_items(items: list[dict]): now = datetime.now(timezone.utc).isoformat() with sqlite3.connect(DB_PATH) as conn: conn.executemany(""" INSERT INTO items (url, title, price_raw, scraped_at) VALUES (?, ?, ?, ?) ON CONFLICT(url) DO UPDATE SET title=excluded.title, price_raw=excluded.price_raw, scraped_at=excluded.scraped_at """, [ (it["url"], it["title"], it.get("price_raw", ""), now) for it in items if it.get("url") ]) conn.commit()
This ensures re-running your scraper updates existing rows instead of creating duplicates.
7) Put it together: a runnable scraper script
Combine fetching, parsing, pagination, and storage:
from urllib.parse import urlparse def scrape_all(base_list_url: str, max_pages: int = 20) -> int: init_db() total_saved = 0 # Use scheme+host as base for resolving relative links parsed = urlparse(base_list_url) base_url = f"{parsed.scheme}://{parsed.netloc}" for page_url in iter_pages(base_list_url, max_pages=max_pages): html = fetch(page_url) items = parse_listing(html, base_url=base_url) if not items: # Stop if the page has no items (common stop condition) break upsert_items(items) total_saved += len(items) print(f"Page: {page_url} -> items: {len(items)}") return total_saved if __name__ == "__main__": # Change this to your target listing URL (without ?page=) BASE_LIST_URL = "https://example.com/products" saved = scrape_all(BASE_LIST_URL, max_pages=30) print(f"Done. Processed ~{saved} items (including updates).")
To run:
python scraper.py
8) Debugging techniques that save hours
- Save raw HTML on failure: If parsing breaks, write the HTML to a file so you can inspect it.
- Log selectors and counts: Print how many cards you found on each page.
- Validate assumptions: If you expect a price, assert it’s present for at least some items.
- Watch for bot defenses: If you suddenly get “Access Denied” pages, you’re not parsing the real content.
Quick “save HTML snapshot” helper:
def save_snapshot(html: str, filename: str = "snapshot.html"): with open(filename, "w", encoding="utf-8") as f: f.write(html)
9) When Requests + BeautifulSoup isn’t enough
Some sites render content with JavaScript. If the HTML response doesn’t include the data you see in the browser, you have three common options:
- Find the underlying JSON call: Open DevTools → Network → XHR/Fetch and use that endpoint directly.
- Use a headless browser: Playwright is a modern choice for JS-heavy sites.
- Use a site-provided feed: Sitemap, RSS, or public API endpoints can be more stable than HTML.
If you want a minimal Playwright “render then parse” pattern, here’s a small example (optional dependency):
# pip install playwright && playwright install from playwright.sync_api import sync_playwright def fetch_rendered(url: str) -> str: with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page(user_agent="ExampleScraper/1.0 (+https://your-site.example/scraper-info)") page.goto(url, wait_until="networkidle", timeout=60_000) html = page.content() browser.close() return html
You can swap fetch() with fetch_rendered() for specific pages when you truly need JS rendering.
10) A reusable checklist for production-ish scrapers
- Use timeouts and retries with exponential backoff
- Throttle with jitter
- Parse defensively (missing fields won’t crash the run)
- Have stop conditions for pagination
- Store data with unique keys (SQLite upsert is perfect)
- Save snapshots for debugging
- Prefer JSON endpoints or official APIs when possible
With this pattern, you’ll write scrapers that survive real websites: not just demo pages. The biggest mindset shift is treating scraping like integration work: networks fail, HTML changes, and your code should bend instead of break.
Leave a Reply