Python Web Scraping in 2026: A Practical Pattern with httpx + BeautifulSoup + SQLite (Polite, Testable, Repeatable)
Web scraping is easy to start and easy to accidentally do badly: you get blocked, you scrape duplicate data, your script breaks when the HTML changes, or you can’t reproduce yesterday’s run. This guide shows a practical, “ship-it” scraping pattern for junior/mid developers using httpx (HTTP client), BeautifulSoup (HTML parsing), and sqlite3 (storage). You’ll build a small scraper that:
- Fetches pages politely (headers, timeouts, rate limiting)
- Parses stable selectors and fails loudly when markup changes
- Stores results idempotently (no duplicates)
- Supports incremental runs (resume / re-run safely)
Use this only where you have permission to scrape (your own site, explicit API/terms allowance, or written permission). Always respect robots.txt and site terms.
Project setup
Create a virtual environment and install dependencies:
python -m venv .venv # mac/linux source .venv/bin/activate # windows (powershell) # .venv\Scripts\Activate.ps1 pip install httpx beautifulsoup4
We’ll use only standard library for storage (sqlite3) so there’s nothing else to install.
Target example: scrape a paginated “blog listing”
Most real scraping jobs look like:
- List pages:
/blog?page=1,/blog?page=2… - Each list page has cards with title/link/date
- Optionally, detail pages with content
We’ll implement both phases: collect article URLs from list pages, then visit detail pages to extract a few fields.
1) A small “polite fetch” wrapper
Scrapers get flaky when you treat the network as reliable. Wrap requests with timeouts, retries, and friendly headers.
from __future__ import annotations import time from dataclasses import dataclass from typing import Optional import httpx @dataclass class FetchConfig: base_url: str timeout_s: float = 20.0 min_delay_s: float = 0.6 # rate limit between requests max_retries: int = 3 user_agent: str = "Mozilla/5.0 (compatible; ExampleScraper/1.0; +https://example.com)" class Fetcher: def __init__(self, cfg: FetchConfig) -> None: self.cfg = cfg self._client = httpx.Client( base_url=cfg.base_url, headers={"User-Agent": cfg.user_agent, "Accept": "text/html,application/xhtml+xml"}, timeout=cfg.timeout_s, follow_redirects=True, ) self._last_request_at = 0.0 def close(self) -> None: self._client.close() def get_text(self, path: str) -> str: # Simple rate limiting elapsed = time.time() - self._last_request_at if elapsed < self.cfg.min_delay_s: time.sleep(self.cfg.min_delay_s - elapsed) last_exc: Optional[Exception] = None for attempt in range(1, self.cfg.max_retries + 1): try: resp = self._client.get(path) self._last_request_at = time.time() resp.raise_for_status() return resp.text except Exception as exc: last_exc = exc # basic backoff time.sleep(0.5 * attempt) raise RuntimeError(f"Failed GET {path} after {self.cfg.max_retries} retries") from last_exc
Why this matters:
timeoutavoids hanging forever.follow_redirectshandles common redirects to canonical URLs.- Delay prevents hammering the site and helps avoid blocks.
2) Parse list pages with “assertive selectors”
Scrapers break when HTML changes. The trick is to fail loudly and early—so you notice changes immediately instead of silently collecting bad data.
from bs4 import BeautifulSoup from typing import Iterable def parse_list_page(html: str) -> list[dict]: soup = BeautifulSoup(html, "html.parser") # Example structure (you'll adapt selectors to your target site): # <article class="post-card"> # <a class="post-card__link" href="/posts/123">...</a> # <h2 class="post-card__title">Title</h2> # <time datetime="2026-02-01">...</time> # </article> cards = soup.select("article.post-card") if not cards: raise ValueError("No post cards found; the markup likely changed.") items: list[dict] = [] for card in cards: a = card.select_one("a.post-card__link") title_el = card.select_one("h2.post-card__title") time_el = card.select_one("time[datetime]") if not a or not title_el: raise ValueError("Expected link/title in a card, but selector failed.") items.append( { "url": a.get("href"), "title": title_el.get_text(strip=True), "published_at": time_el.get("datetime") if time_el else None, } ) return items def parse_next_page_path(html: str) -> str | None: soup = BeautifulSoup(html, "html.parser") next_a = soup.select_one('a[rel="next"]') return next_a.get("href") if next_a else None
Two good habits here:
- Use
select()/select_one()with CSS selectors (readable and common). - Throw an exception if key elements are missing, rather than returning garbage.
3) Parse detail pages for richer fields
List pages usually lack full content. Add a detail parser that extracts a stable subset: canonical URL, summary, tags, etc.
def parse_detail_page(html: str) -> dict: soup = BeautifulSoup(html, "html.parser") # Example: # <h1 class="post-title">...</h1> # <div class="post-content">...</div> # <meta property="og:url" content="https://site.com/posts/123"> title = soup.select_one("h1.post-title") content = soup.select_one("div.post-content") og_url = soup.select_one('meta[property="og:url"]') if not title or not content: raise ValueError("Detail page missing title/content; markup changed.") text = content.get_text("\n", strip=True) canonical = og_url.get("content") if og_url else None return { "canonical_url": canonical, "title": title.get_text(strip=True), "content_text": text, }
Tip: Don’t try to scrape “everything.” Pick fields you actually need, and make them reliable.
4) Store results safely with SQLite (idempotent upserts)
When you rerun a scraper, you want it to update existing records instead of inserting duplicates. SQLite is perfect for a lightweight pipeline.
import sqlite3 from pathlib import Path def init_db(db_path: str = "scrape.db") -> sqlite3.Connection: Path(db_path).parent.mkdir(parents=True, exist_ok=True) conn = sqlite3.connect(db_path) conn.execute( """ CREATE TABLE IF NOT EXISTS posts ( url TEXT PRIMARY KEY, canonical_url TEXT, title TEXT NOT NULL, published_at TEXT, content_text TEXT, scraped_at TEXT NOT NULL DEFAULT (datetime('now')) ) """ ) return conn def upsert_post(conn: sqlite3.Connection, row: dict) -> None: conn.execute( """ INSERT INTO posts (url, canonical_url, title, published_at, content_text) VALUES (:url, :canonical_url, :title, :published_at, :content_text) ON CONFLICT(url) DO UPDATE SET canonical_url=excluded.canonical_url, title=excluded.title, published_at=excluded.published_at, content_text=excluded.content_text, scraped_at=datetime('now') """, row, ) conn.commit()
The PRIMARY KEY on url + ON CONFLICT gives you “run it as many times as you want” behavior.
5) Put it together: crawl, parse, store
This orchestrator does:
- Start at
/blog?page=1(or any path you choose) - Parse cards and store list fields
- Visit each detail page and update with content
- Follow “next page” links until done
from urllib.parse import urlparse def normalize_path(href: str) -> str: """ Convert full URLs to paths for httpx base_url usage. Keeps relative paths as-is. """ if href.startswith("http://") or href.startswith("https://"): return urlparse(href).path + (("?" + urlparse(href).query) if urlparse(href).query else "") return href def run_scrape() -> None: cfg = FetchConfig(base_url="https://example.com") fetcher = Fetcher(cfg) conn = init_db("scrape.db") try: page_path = "/blog?page=1" seen_pages = 0 while page_path: seen_pages += 1 html = fetcher.get_text(page_path) items = parse_list_page(html) for item in items: # Store basic metadata (content_text empty for now) row = { "url": normalize_path(item["url"]), "canonical_url": None, "title": item["title"], "published_at": item["published_at"], "content_text": None, } upsert_post(conn, row) # Fetch detail and update detail_html = fetcher.get_text(row["url"]) detail = parse_detail_page(detail_html) upsert_post( conn, { **row, "canonical_url": detail.get("canonical_url"), "title": detail.get("title") or row["title"], "content_text": detail.get("content_text"), }, ) next_path = parse_next_page_path(html) page_path = normalize_path(next_path) if next_path else None print(f"Done. Crawled {seen_pages} pages.") finally: fetcher.close() conn.close() if __name__ == "__main__": run_scrape()
Make it real: replace https://example.com and the CSS selectors in parse_list_page/parse_detail_page with selectors for your target site.
Debugging and “scraper hygiene” tips
-
Start with a single page. Hardcode one list page and print extracted fields before adding pagination.
-
Save HTML snapshots when things break. If
ValueErrorfires, write the HTML todebug.htmlso you can inspect it. -
Prefer stable attributes. Classes like
post-card__titleare okay; data attributes (data-testid,data-*) are often more stable. -
Don’t over-parallelize. Concurrency can trigger rate limiting and bans. Get correctness first, then add careful concurrency if needed.
-
Respect caching. If the site provides ETags/Last-Modified, you can store them and use conditional requests later (advanced but worth it at scale).
Common extension: incremental mode (scrape only “new” posts)
Once you have a database, you can avoid re-fetching detail pages you already scraped. For example, add a check like:
def already_has_content(conn: sqlite3.Connection, url: str) -> bool: cur = conn.execute("SELECT content_text FROM posts WHERE url = ?", (url,)) row = cur.fetchone() return bool(row and row[0]) # Then in the loop, before fetching detail: if not already_has_content(conn, row["url"]): detail_html = fetcher.get_text(row["url"]) ...
This turns your scraper into a daily job you can rerun safely.
Wrap-up
A reliable scraper isn’t “a loop that downloads pages.” It’s a small pipeline: polite fetching, assertive parsing, and idempotent storage. With the pattern above, you can confidently adapt selectors to your target site, rerun the job without duplicates, and detect markup changes quickly—exactly what you want when you’re maintaining scraping scripts on a real team.
Leave a Reply