Python Web Scraping in 2026: A Practical, Repeatable Pattern (Requests + BeautifulSoup + Retries + Storage)

Web scraping sounds easy until you hit real-world problems: flaky connections, inconsistent HTML, rate limits, pagination, and “works on my machine” scripts that fall apart the next day. This hands-on guide shows a practical scraping pattern you can reuse across projects: fetch reliably, parse defensively, back off politely, and store results in a structured way.

What you’ll build: a small scraper that collects product-like items from a paginated listing, normalizes fields, and saves them to SQLite (and optionally CSV). The code examples are “drop-in runnable” once you set the target URLs and CSS selectors for your site.

1) Ground rules: scrape responsibly

Check terms/robots: Many sites forbid scraping. Always confirm you’re allowed.
Identify yourself: Use a clear User-Agent and optionally a contact URL/email.
Throttle requests: Sleep between requests and use exponential backoff on errors.
Avoid unnecessary load: Cache pages while developing and don’t hammer endpoints.
Prefer APIs when available: If the site has a public API, use it instead of HTML parsing.

2) Project setup

Create a virtual environment and install dependencies:

python -m venv .venv # macOS/Linux source .venv/bin/activate # Windows (PowerShell) # .venv\Scripts\Activate.ps1 pip install requests beautifulsoup4 lxml tenacity

We’ll use:

requests for HTTP
BeautifulSoup (+ lxml) for parsing
tenacity for clean retries with backoff
sqlite3 from the standard library for storage

3) Reliable fetching: timeouts, retries, and polite throttling

A surprising number of scraping scripts fail because they don’t set timeouts, don’t retry transient errors, and treat every response as valid HTML. Start with a hardened fetch function:

import time import random import requests from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type SESSION = requests.Session() SESSION.headers.update({ "User-Agent": "ExampleScraper/1.0 (+https://your-site.example/scraper-info)" }) class FetchError(Exception): pass @retry( reraise=True, stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=1, max=20), retry=retry_if_exception_type((requests.RequestException, FetchError)), ) def fetch(url: str) -> str: # Small jitter helps avoid “thundering herd” patterns. time.sleep(random.uniform(0.6, 1.4)) resp = SESSION.get(url, timeout=(5, 20)) # (connect timeout, read timeout) # Handle basic rate limiting or temporary blocks if resp.status_code in (429, 503): raise FetchError(f"Rate-limited or unavailable: {resp.status_code}") resp.raise_for_status() # Basic sanity check: ensure we're getting HTML content_type = resp.headers.get("Content-Type", "") if "text/html" not in content_type and "application/xhtml+xml" not in content_type: raise FetchError(f"Unexpected Content-Type: {content_type}") return resp.text

This function gives you:

Timeouts so the script doesn’t hang
Retry with exponential backoff for transient failures
Basic detection of rate limiting
Content-Type sanity checks

4) Parse defensively: CSS selectors + “best effort” fields

HTML changes. A robust scraper doesn’t assume everything exists; it extracts fields carefully and defaults gracefully.

Let’s assume the listing page has “cards” like:

Title: .card-title
Price: .price
Item URL: link on a.card-link

Here’s a parser that returns normalized items:

from bs4 import BeautifulSoup from urllib.parse import urljoin def clean_text(value: str | None) -> str: if not value: return "" return " ".join(value.split()).strip() def parse_listing(html: str, base_url: str) -> list[dict]: soup = BeautifulSoup(html, "lxml") items: list[dict] = [] for card in soup.select(".card"): title = clean_text(card.select_one(".card-title") and card.select_one(".card-title").get_text()) price = clean_text(card.select_one(".price") and card.select_one(".price").get_text()) link_el = card.select_one("a.card-link") href = link_el.get("href") if link_el else None url = urljoin(base_url, href) if href else "" if not title and not url: # Skip malformed cards continue items.append({ "title": title, "price_raw": price, "url": url, }) return items

Tip: Keep your selectors near the top of the file (or in a config) so updating them is painless.

5) Pagination: stop conditions that won’t loop forever

Many sites paginate via ?page=2, “Next” links, or infinite scrolling. For junior/mid devs, the simplest reliable pattern is: build a “next page URL,” fetch it, parse items, and stop when either:

No items were found, or
You detect there is no “Next” link, or
You hit a max page limit (safety)

Example using a ?page= parameter:

def iter_pages(base_list_url: str, max_pages: int = 20): # Example: https://example.com/products?page=1 for page in range(1, max_pages + 1): yield f"{base_list_url}?page={page}"

6) Store results properly: SQLite with upserts

Saving to a JSON file works for tiny scripts, but SQLite is a great “next step” because it’s built-in, queryable, and supports uniqueness constraints. We’ll store items keyed by URL.

import sqlite3 from datetime import datetime, timezone DB_PATH = "scraped_items.sqlite" def init_db(): with sqlite3.connect(DB_PATH) as conn: conn.execute(""" CREATE TABLE IF NOT EXISTS items ( url TEXT PRIMARY KEY, title TEXT NOT NULL, price_raw TEXT, scraped_at TEXT NOT NULL ) """) conn.commit() def upsert_items(items: list[dict]): now = datetime.now(timezone.utc).isoformat() with sqlite3.connect(DB_PATH) as conn: conn.executemany(""" INSERT INTO items (url, title, price_raw, scraped_at) VALUES (?, ?, ?, ?) ON CONFLICT(url) DO UPDATE SET title=excluded.title, price_raw=excluded.price_raw, scraped_at=excluded.scraped_at """, [ (it["url"], it["title"], it.get("price_raw", ""), now) for it in items if it.get("url") ]) conn.commit()

This ensures re-running your scraper updates existing rows instead of creating duplicates.

7) Put it together: a runnable scraper script

Combine fetching, parsing, pagination, and storage:

from urllib.parse import urlparse def scrape_all(base_list_url: str, max_pages: int = 20) -> int: init_db() total_saved = 0 # Use scheme+host as base for resolving relative links parsed = urlparse(base_list_url) base_url = f"{parsed.scheme}://{parsed.netloc}" for page_url in iter_pages(base_list_url, max_pages=max_pages): html = fetch(page_url) items = parse_listing(html, base_url=base_url) if not items: # Stop if the page has no items (common stop condition) break upsert_items(items) total_saved += len(items) print(f"Page: {page_url} -> items: {len(items)}") return total_saved if __name__ == "__main__": # Change this to your target listing URL (without ?page=) BASE_LIST_URL = "https://example.com/products" saved = scrape_all(BASE_LIST_URL, max_pages=30) print(f"Done. Processed ~{saved} items (including updates).")

To run:

python scraper.py

8) Debugging techniques that save hours

Save raw HTML on failure: If parsing breaks, write the HTML to a file so you can inspect it.
Log selectors and counts: Print how many cards you found on each page.
Validate assumptions: If you expect a price, assert it’s present for at least some items.
Watch for bot defenses: If you suddenly get “Access Denied” pages, you’re not parsing the real content.

Quick “save HTML snapshot” helper:

def save_snapshot(html: str, filename: str = "snapshot.html"): with open(filename, "w", encoding="utf-8") as f: f.write(html)

9) When Requests + BeautifulSoup isn’t enough

Some sites render content with JavaScript. If the HTML response doesn’t include the data you see in the browser, you have three common options:

Find the underlying JSON call: Open DevTools → Network → XHR/Fetch and use that endpoint directly.
Use a headless browser: Playwright is a modern choice for JS-heavy sites.
Use a site-provided feed: Sitemap, RSS, or public API endpoints can be more stable than HTML.

If you want a minimal Playwright “render then parse” pattern, here’s a small example (optional dependency):

# pip install playwright && playwright install from playwright.sync_api import sync_playwright def fetch_rendered(url: str) -> str: with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page(user_agent="ExampleScraper/1.0 (+https://your-site.example/scraper-info)") page.goto(url, wait_until="networkidle", timeout=60_000) html = page.content() browser.close() return html

You can swap fetch() with fetch_rendered() for specific pages when you truly need JS rendering.

10) A reusable checklist for production-ish scrapers

Use timeouts and retries with exponential backoff
Throttle with jitter
Parse defensively (missing fields won’t crash the run)
Have stop conditions for pagination
Store data with unique keys (SQLite upsert is perfect)
Save snapshots for debugging
Prefer JSON endpoints or official APIs when possible

With this pattern, you’ll write scrapers that survive real websites: not just demo pages. The biggest mindset shift is treating scraping like integration work: networks fail, HTML changes, and your code should bend instead of break.

Python Web Scraping in 2026: A Practical, Repeatable Pattern (Requests + BeautifulSoup + Retries + Storage)