Python Web Scraping in Practice: A Robust “Polite Scraper” with Requests + BeautifulSoup + SQLite

Python Web Scraping in Practice: A Robust “Polite Scraper” with Requests + BeautifulSoup + SQLite

Web scraping sounds easy—until you hit rate limits, broken HTML, duplicate pages, or “it worked yesterday” failures. This hands-on guide shows how to build a small but production-minded scraper in Python that:

  • Fetches pages reliably with retries and timeouts
  • Respects servers with a rate limiter and a clear User-Agent
  • Parses imperfect HTML defensively
  • Stores results safely in SQLite (idempotent: re-runs won’t duplicate data)
  • Handles pagination and basic deduplication

What we’ll scrape: A listing page with multiple article cards linking to detail pages. You can adapt the selectors to your target site. Always check a site’s Terms of Service and comply with applicable laws and permissions.

Project Setup

Create a virtual environment and install dependencies:

python -m venv .venv # macOS/Linux source .venv/bin/activate # Windows # .venv\Scripts\activate pip install requests beautifulsoup4

We’ll use only standard library + requests + beautifulsoup4. SQLite is built in.

Step 1: A “polite” HTTP client (timeouts, retries, headers)

Most scraping issues are networking issues: slow servers, flaky connections, transient 500s. Start by centralizing HTTP logic.

import time import random from dataclasses import dataclass from typing import Optional import requests from requests import Response @dataclass class HttpConfig: timeout_s: float = 15.0 max_retries: int = 3 backoff_base_s: float = 0.8 # exponential backoff base user_agent: str = "JuniorScraperBot/1.0 (+https://example.com/bot)" class PoliteHttpClient: def __init__(self, config: HttpConfig): self.config = config self.session = requests.Session() self.session.headers.update({ "User-Agent": self.config.user_agent, "Accept": "text/html,application/xhtml+xml", }) def get(self, url: str) -> Response: last_exc: Optional[Exception] = None for attempt in range(1, self.config.max_retries + 1): try: resp = self.session.get(url, timeout=self.config.timeout_s) # Retry on transient server errors and rate-limits if resp.status_code in (429, 500, 502, 503, 504): raise RuntimeError(f"Transient HTTP {resp.status_code}") resp.raise_for_status() return resp except Exception as exc: last_exc = exc sleep_s = (self.config.backoff_base_s ** attempt) + random.uniform(0, 0.25) time.sleep(sleep_s) raise RuntimeError(f"GET failed after retries: {url}") from last_exc

Why this matters: a single failed request shouldn’t crash your run. Also, always use a sensible User-Agent—don’t pretend to be a browser if you’re not.

Step 2: Add rate limiting (avoid hammering servers)

Even if your requests succeed, sending them too fast is a good way to get blocked. Add a simple limiter: at most 1 request every min_interval_s.

class RateLimiter: def __init__(self, min_interval_s: float = 1.0): self.min_interval_s = min_interval_s self._last_time = 0.0 def wait(self) -> None: now = time.time() elapsed = now - self._last_time if elapsed < self.min_interval_s: time.sleep(self.min_interval_s - elapsed) self._last_time = time.time()

Use it right before each HTTP call.

Step 3: Parse listings and detail pages (defensive BeautifulSoup)

HTML is messy. Your scraper should survive missing elements and layout changes. Define small parsing functions that return None or default values instead of crashing.

Assume listing pages contain cards like:

  • Title inside .card a
  • Link in the href
from bs4 import BeautifulSoup from urllib.parse import urljoin def parse_listing(html: str, base_url: str) -> list[dict]: soup = BeautifulSoup(html, "html.parser") items: list[dict] = [] for a in soup.select(".card a"): title = (a.get_text(strip=True) or "").strip() href = a.get("href") if not href: continue url = urljoin(base_url, href) if not title: title = url # fallback items.append({ "title": title, "url": url, }) return items

Now parse a detail page. Let’s say the article body is in article and we want a text snippet.

def parse_detail(html: str) -> dict: soup = BeautifulSoup(html, "html.parser") article = soup.select_one("article") if not article: return {"snippet": "", "word_count": 0} text = " ".join(article.get_text(" ", strip=True).split()) snippet = text[:200] + ("…" if len(text) > 200 else "") word_count = len(text.split()) return {"snippet": snippet, "word_count": word_count}

Tip: start with a small set of fields you actually need. Scraping everything “just in case” increases breakage and slows runs.

Step 4: Store results in SQLite (idempotent upserts)

Storing data in a database makes your scraper restartable and prevents duplicates. SQLite is perfect for small/medium jobs.

import sqlite3 def init_db(path: str = "scrape.db") -> sqlite3.Connection: conn = sqlite3.connect(path) conn.execute(""" CREATE TABLE IF NOT EXISTS pages ( url TEXT PRIMARY KEY, title TEXT NOT NULL, snippet TEXT NOT NULL, word_count INTEGER NOT NULL, scraped_at TEXT NOT NULL ) """) conn.execute("CREATE INDEX IF NOT EXISTS idx_pages_title ON pages(title)") conn.commit() return conn def upsert_page(conn: sqlite3.Connection, row: dict) -> None: conn.execute(""" INSERT INTO pages (url, title, snippet, word_count, scraped_at) VALUES (:url, :title, :snippet, :word_count, :scraped_at) ON CONFLICT(url) DO UPDATE SET title=excluded.title, snippet=excluded.snippet, word_count=excluded.word_count, scraped_at=excluded.scraped_at """, row) conn.commit()

With url as the primary key, re-running the scraper updates existing rows instead of creating duplicates.

Step 5: Put it together—crawl listings, visit details, save rows

This script will:

  • Loop through paginated listing URLs
  • Extract item URLs
  • Fetch each detail page
  • Store results
from datetime import datetime, timezone def now_iso() -> str: return datetime.now(timezone.utc).isoformat() def scrape(base_list_url: str, base_url: str, pages: int = 3) -> None: http = PoliteHttpClient(HttpConfig()) limiter = RateLimiter(min_interval_s=1.0) conn = init_db("scrape.db") seen_urls: set[str] = set() for page in range(1, pages + 1): list_url = f"{base_list_url}?page={page}" limiter.wait() resp = http.get(list_url) items = parse_listing(resp.text, base_url=base_url) print(f"[listing] page={page} items={len(items)}") for item in items: url = item["url"] if url in seen_urls: continue seen_urls.add(url) limiter.wait() detail_resp = http.get(url) detail = parse_detail(detail_resp.text) row = { "url": url, "title": item["title"], "snippet": detail["snippet"], "word_count": detail["word_count"], "scraped_at": now_iso(), } upsert_page(conn, row) print(f" [saved] {row['title']} ({row['word_count']} words)") conn.close() if __name__ == "__main__": # Example: # base_list_url: listing endpoint (add ?page=N) # base_url: used for resolving relative URLs scrape( base_list_url="https://example.com/blog", base_url="https://example.com", pages=3 )

Adjust selectors in parse_listing/parse_detail and the URL patterns. Keep the rest: retries + throttling + DB upserts make your scraper “boringly reliable.”

Step 6: Common hardening tweaks (junior-friendly)

  • Skip already-scraped URLs: Instead of only seen_urls in memory, you can read known URLs from SQLite at startup.
  • Handle non-HTML content: Check Content-Type and skip PDFs or binaries unless you explicitly support them.
  • Detect layout changes: If your selectors return 0 results unexpectedly, log a warning and save the raw HTML for debugging.
  • Be gentle on errors: Wrap detail fetch + parse in try/except so one broken page doesn’t stop the run.

Here’s a quick “load existing URLs” helper:

def load_existing_urls(conn: sqlite3.Connection) -> set[str]: rows = conn.execute("SELECT url FROM pages").fetchall() return {r[0] for r in rows}

Then in scrape():

# after init_db seen_urls = load_existing_urls(conn)

Debugging workflow: make failures easy to reproduce

When parsing fails, you need the exact HTML that caused it. A small trick: dump problematic responses to disk.

from pathlib import Path def dump_html(slug: str, html: str) -> None: Path("debug").mkdir(exist_ok=True) Path(f"debug/{slug}.html").write_text(html, encoding="utf-8")

Use it when parse_detail returns empty content or when selectors find nothing. This saves hours of guesswork.

Where to go next

Once you’re comfortable with this pattern, you can expand it safely:

  • Concurrency: Add threading or async fetching only after your parsing/storage is solid (and keep rate limits!).
  • Structured exports: Add a small CLI command to export to CSV/JSON.
  • Better normalization: Store extra fields (author, date) and normalize into separate tables if needed.
  • Respect robots and crawl rules: Build allowlists of paths and handle “noindex/noarchive” policies if applicable.

If you copy this into your blog’s codebase as a starter template, you’ll have a scraper that behaves well, survives transient failures, and can be rerun safely—exactly what you want in real projects.


Leave a Reply

Your email address will not be published. Required fields are marked *