Python Web Scraping in 2026: A Practical, Repeatable Pattern (Requests + BeautifulSoup + Retries + Storage)

Python Web Scraping in 2026: A Practical, Repeatable Pattern (Requests + BeautifulSoup + Retries + Storage)

Web scraping sounds easy until you hit real-world problems: flaky connections, inconsistent HTML, rate limits, pagination, and “works on my machine” scripts that fall apart the next day. This hands-on guide shows a practical scraping pattern you can reuse across projects: fetch reliably, parse defensively, back off politely, and store results in a structured way.

What you’ll build: a small scraper that collects product-like items from a paginated listing, normalizes fields, and saves them to SQLite (and optionally CSV). The code examples are “drop-in runnable” once you set the target URLs and CSS selectors for your site.

1) Ground rules: scrape responsibly

  • Check terms/robots: Many sites forbid scraping. Always confirm you’re allowed.
  • Identify yourself: Use a clear User-Agent and optionally a contact URL/email.
  • Throttle requests: Sleep between requests and use exponential backoff on errors.
  • Avoid unnecessary load: Cache pages while developing and don’t hammer endpoints.
  • Prefer APIs when available: If the site has a public API, use it instead of HTML parsing.

2) Project setup

Create a virtual environment and install dependencies:

python -m venv .venv # macOS/Linux source .venv/bin/activate # Windows (PowerShell) # .venv\Scripts\Activate.ps1 pip install requests beautifulsoup4 lxml tenacity

We’ll use:

  • requests for HTTP
  • BeautifulSoup (+ lxml) for parsing
  • tenacity for clean retries with backoff
  • sqlite3 from the standard library for storage

3) Reliable fetching: timeouts, retries, and polite throttling

A surprising number of scraping scripts fail because they don’t set timeouts, don’t retry transient errors, and treat every response as valid HTML. Start with a hardened fetch function:

import time import random import requests from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type SESSION = requests.Session() SESSION.headers.update({ "User-Agent": "ExampleScraper/1.0 (+https://your-site.example/scraper-info)" }) class FetchError(Exception): pass @retry( reraise=True, stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=1, max=20), retry=retry_if_exception_type((requests.RequestException, FetchError)), ) def fetch(url: str) -> str: # Small jitter helps avoid “thundering herd” patterns. time.sleep(random.uniform(0.6, 1.4)) resp = SESSION.get(url, timeout=(5, 20)) # (connect timeout, read timeout) # Handle basic rate limiting or temporary blocks if resp.status_code in (429, 503): raise FetchError(f"Rate-limited or unavailable: {resp.status_code}") resp.raise_for_status() # Basic sanity check: ensure we're getting HTML content_type = resp.headers.get("Content-Type", "") if "text/html" not in content_type and "application/xhtml+xml" not in content_type: raise FetchError(f"Unexpected Content-Type: {content_type}") return resp.text

This function gives you:

  • Timeouts so the script doesn’t hang
  • Retry with exponential backoff for transient failures
  • Basic detection of rate limiting
  • Content-Type sanity checks

4) Parse defensively: CSS selectors + “best effort” fields

HTML changes. A robust scraper doesn’t assume everything exists; it extracts fields carefully and defaults gracefully.

Let’s assume the listing page has “cards” like:

  • Title: .card-title
  • Price: .price
  • Item URL: link on a.card-link

Here’s a parser that returns normalized items:

from bs4 import BeautifulSoup from urllib.parse import urljoin def clean_text(value: str | None) -> str: if not value: return "" return " ".join(value.split()).strip() def parse_listing(html: str, base_url: str) -> list[dict]: soup = BeautifulSoup(html, "lxml") items: list[dict] = [] for card in soup.select(".card"): title = clean_text(card.select_one(".card-title") and card.select_one(".card-title").get_text()) price = clean_text(card.select_one(".price") and card.select_one(".price").get_text()) link_el = card.select_one("a.card-link") href = link_el.get("href") if link_el else None url = urljoin(base_url, href) if href else "" if not title and not url: # Skip malformed cards continue items.append({ "title": title, "price_raw": price, "url": url, }) return items

Tip: Keep your selectors near the top of the file (or in a config) so updating them is painless.

5) Pagination: stop conditions that won’t loop forever

Many sites paginate via ?page=2, “Next” links, or infinite scrolling. For junior/mid devs, the simplest reliable pattern is: build a “next page URL,” fetch it, parse items, and stop when either:

  • No items were found, or
  • You detect there is no “Next” link, or
  • You hit a max page limit (safety)

Example using a ?page= parameter:

def iter_pages(base_list_url: str, max_pages: int = 20): # Example: https://example.com/products?page=1 for page in range(1, max_pages + 1): yield f"{base_list_url}?page={page}"

6) Store results properly: SQLite with upserts

Saving to a JSON file works for tiny scripts, but SQLite is a great “next step” because it’s built-in, queryable, and supports uniqueness constraints. We’ll store items keyed by URL.

import sqlite3 from datetime import datetime, timezone DB_PATH = "scraped_items.sqlite" def init_db(): with sqlite3.connect(DB_PATH) as conn: conn.execute(""" CREATE TABLE IF NOT EXISTS items ( url TEXT PRIMARY KEY, title TEXT NOT NULL, price_raw TEXT, scraped_at TEXT NOT NULL ) """) conn.commit() def upsert_items(items: list[dict]): now = datetime.now(timezone.utc).isoformat() with sqlite3.connect(DB_PATH) as conn: conn.executemany(""" INSERT INTO items (url, title, price_raw, scraped_at) VALUES (?, ?, ?, ?) ON CONFLICT(url) DO UPDATE SET title=excluded.title, price_raw=excluded.price_raw, scraped_at=excluded.scraped_at """, [ (it["url"], it["title"], it.get("price_raw", ""), now) for it in items if it.get("url") ]) conn.commit()

This ensures re-running your scraper updates existing rows instead of creating duplicates.

7) Put it together: a runnable scraper script

Combine fetching, parsing, pagination, and storage:

from urllib.parse import urlparse def scrape_all(base_list_url: str, max_pages: int = 20) -> int: init_db() total_saved = 0 # Use scheme+host as base for resolving relative links parsed = urlparse(base_list_url) base_url = f"{parsed.scheme}://{parsed.netloc}" for page_url in iter_pages(base_list_url, max_pages=max_pages): html = fetch(page_url) items = parse_listing(html, base_url=base_url) if not items: # Stop if the page has no items (common stop condition) break upsert_items(items) total_saved += len(items) print(f"Page: {page_url} -> items: {len(items)}") return total_saved if __name__ == "__main__": # Change this to your target listing URL (without ?page=) BASE_LIST_URL = "https://example.com/products" saved = scrape_all(BASE_LIST_URL, max_pages=30) print(f"Done. Processed ~{saved} items (including updates).")

To run:

python scraper.py

8) Debugging techniques that save hours

  • Save raw HTML on failure: If parsing breaks, write the HTML to a file so you can inspect it.
  • Log selectors and counts: Print how many cards you found on each page.
  • Validate assumptions: If you expect a price, assert it’s present for at least some items.
  • Watch for bot defenses: If you suddenly get “Access Denied” pages, you’re not parsing the real content.

Quick “save HTML snapshot” helper:

def save_snapshot(html: str, filename: str = "snapshot.html"): with open(filename, "w", encoding="utf-8") as f: f.write(html)

9) When Requests + BeautifulSoup isn’t enough

Some sites render content with JavaScript. If the HTML response doesn’t include the data you see in the browser, you have three common options:

  • Find the underlying JSON call: Open DevTools → Network → XHR/Fetch and use that endpoint directly.
  • Use a headless browser: Playwright is a modern choice for JS-heavy sites.
  • Use a site-provided feed: Sitemap, RSS, or public API endpoints can be more stable than HTML.

If you want a minimal Playwright “render then parse” pattern, here’s a small example (optional dependency):

# pip install playwright && playwright install from playwright.sync_api import sync_playwright def fetch_rendered(url: str) -> str: with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page(user_agent="ExampleScraper/1.0 (+https://your-site.example/scraper-info)") page.goto(url, wait_until="networkidle", timeout=60_000) html = page.content() browser.close() return html

You can swap fetch() with fetch_rendered() for specific pages when you truly need JS rendering.

10) A reusable checklist for production-ish scrapers

  • Use timeouts and retries with exponential backoff
  • Throttle with jitter
  • Parse defensively (missing fields won’t crash the run)
  • Have stop conditions for pagination
  • Store data with unique keys (SQLite upsert is perfect)
  • Save snapshots for debugging
  • Prefer JSON endpoints or official APIs when possible

With this pattern, you’ll write scrapers that survive real websites: not just demo pages. The biggest mindset shift is treating scraping like integration work: networks fail, HTML changes, and your code should bend instead of break.


Leave a Reply

Your email address will not be published. Required fields are marked *