Python Web Scraping in Practice: Resilient Crawls with Requests, BeautifulSoup, and a SQLite Pipeline

Python Web Scraping in Practice: Resilient Crawls with Requests, BeautifulSoup, and a SQLite Pipeline

Web scraping is easy when a page is simple—and frustrating when you hit rate limits, flaky HTML, and messy data. This hands-on guide shows a practical scraping workflow that junior/mid developers can ship: fetch pages reliably, parse consistently, avoid common traps, and store results in a database so you can resume and iterate.

What you’ll build: a small scraper that collects product-like cards (title, price, link) from one or more pages, normalizes the data, and saves it to SQLite with deduping and safe retries.

Project Setup

Create a virtual environment and install dependencies:

python -m venv .venv # Windows: .venv\Scripts\activate source .venv/bin/activate pip install requests beautifulsoup4 lxml tenacity

We’ll use:

  • requests for HTTP
  • BeautifulSoup + lxml for parsing
  • tenacity for retries with backoff
  • sqlite3 (built-in) for storage

Step 1: Fetch Pages Like You Mean It

Most scraping failures are “network-ish”: timeouts, 429s, transient 5xx errors. You want a single function that handles headers, timeouts, retries, and respectful pacing.

import time import random import requests from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type DEFAULT_HEADERS = { "User-Agent": "Mozilla/5.0 (compatible; MiniScraper/1.0; +https://example.com/bot)", "Accept-Language": "en-US,en;q=0.9", } class FetchError(Exception): pass @retry( stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=1, max=20), retry=retry_if_exception_type((requests.RequestException, FetchError)), ) def fetch_html(url: str, session: requests.Session) -> str: resp = session.get(url, headers=DEFAULT_HEADERS, timeout=(5, 20)) # Handle common “try again later” scenarios if resp.status_code in (429, 500, 502, 503, 504): raise FetchError(f"Retryable status: {resp.status_code}") resp.raise_for_status() return resp.text def polite_sleep(min_s=0.7, max_s=1.6): time.sleep(random.uniform(min_s, max_s))

Why this matters: retries should be centralized so every caller gets consistent behavior. Also, timeout=(connect, read) prevents hanging forever.

Step 2: Parse HTML with Stable Selectors

Your parsing will break if you anchor on fragile structure. Prefer stable attributes like data-*, semantic class names, or repeated card containers.

Below is a parser that expects a “card” container and tries multiple selector fallbacks. Adapt selectors to your target site (and make sure you’re allowed to scrape it).

from bs4 import BeautifulSoup from urllib.parse import urljoin def parse_items(html: str, base_url: str) -> list[dict]: soup = BeautifulSoup(html, "lxml") # Example: cards could be <div class="product-card">...</div> cards = soup.select(".product-card, [data-testid='product-card'], .card.product") items = [] for card in cards: # Title fallbacks title_el = card.select_one(".title, .product-title, [data-testid='title'], h2, h3") title = title_el.get_text(strip=True) if title_el else None # Price fallbacks price_el = card.select_one(".price, .product-price, [data-testid='price']") price_text = price_el.get_text(" ", strip=True) if price_el else None # Link fallbacks link_el = card.select_one("a[href]") href = link_el["href"] if link_el and link_el.has_attr("href") else None link = urljoin(base_url, href) if href else None if not title: # Skip incomplete cards; you can also log these for debugging. continue items.append({ "title": title, "price_text": price_text, "link": link, }) return items

Tip: During development, save HTML snapshots when parsing fails. It’s the fastest way to diagnose selector drift.

Step 3: Normalize Data (Prices, URLs, and Text)

Raw scraped fields are inconsistent. Normalize early so downstream code stays simple. Here’s a basic price parser that pulls the first decimal-ish number out of a string.

import re from decimal import Decimal, InvalidOperation _price_re = re.compile(r"(\d{1,3}(?:[,\s]\d{3})*(?:\.\d+)?|\d+(?:\.\d+)?)") def normalize_price(price_text: str | None) -> Decimal | None: if not price_text: return None m = _price_re.search(price_text) if not m: return None raw = m.group(1).replace(" ", "").replace(",", "") try: return Decimal(raw) except InvalidOperation: return None

Now enrich items with a price field:

def enrich(items: list[dict]) -> list[dict]: out = [] for it in items: out.append({ **it, "price": normalize_price(it.get("price_text")), }) return out

Step 4: Store Results in SQLite (With Deduping)

Saving to a DB gives you: deduping, resumability, and easy exports. We’ll dedupe by link if present, otherwise by title.

import sqlite3 from datetime import datetime def init_db(conn: sqlite3.Connection): conn.execute(""" CREATE TABLE IF NOT EXISTS items ( id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT NOT NULL, price REAL, price_text TEXT, link TEXT, fingerprint TEXT NOT NULL UNIQUE, scraped_at TEXT NOT NULL ); """) conn.commit() def fingerprint(item: dict) -> str: # Prefer stable link-based identity return (item.get("link") or item["title"]).strip().lower() def upsert_items(conn: sqlite3.Connection, items: list[dict]): now = datetime.utcnow().isoformat() rows = [] for it in items: rows.append(( it["title"], float(it["price"]) if it.get("price") is not None else None, it.get("price_text"), it.get("link"), fingerprint(it), now )) conn.executemany(""" INSERT OR IGNORE INTO items (title, price, price_text, link, fingerprint, scraped_at) VALUES (?, ?, ?, ?, ?, ?); """, rows) conn.commit()

Why INSERT OR IGNORE? It’s a simple way to avoid duplicates without writing complex merge logic. For more advanced use cases, add an UPDATE path when data changes.

Step 5: Crawl Multiple Pages (Pagination)

Pagination is where scrapers become “real.” Here’s a flexible approach: you provide a function that finds the “next page” URL, or you generate URLs for known page patterns.

Option A: follow “next” links:

from urllib.parse import urljoin def find_next_page(html: str, base_url: str) -> str | None: soup = BeautifulSoup(html, "lxml") next_el = soup.select_one("a[rel='next'], a.next, [data-testid='next']") if not next_el or not next_el.has_attr("href"): return None return urljoin(base_url, next_el["href"])

Option B: generate URLs by page number:

def page_urls(template: str, start: int, end: int) -> list[str]: # template example: "https://example.com/products?page={page}" return [template.format(page=i) for i in range(start, end + 1)]

Putting It Together: A Minimal End-to-End Scraper

This script crawls starting from one URL, follows “next” links up to a limit, parses items, normalizes them, and saves them.

import requests import sqlite3 def run(start_url: str, db_path: str = "scrape.db", max_pages: int = 5): with requests.Session() as session, sqlite3.connect(db_path) as conn: init_db(conn) url = start_url pages = 0 while url and pages < max_pages: print(f"Fetching: {url}") html = fetch_html(url, session=session) items = parse_items(html, base_url=url) items = enrich(items) upsert_items(conn, items) print(f" saved {len(items)} items (page {pages + 1})") url = find_next_page(html, base_url=url) pages += 1 polite_sleep() total = conn.execute("SELECT COUNT(*) FROM items").fetchone()[0] print(f"Done. Total stored items: {total}") if __name__ == "__main__": # Replace with a real URL you are allowed to scrape. run("https://example.com/products")

Debugging Tricks That Save Hours

  • Print the first card HTML: when selectors fail, inspect cards[0].prettify() to see the real structure.
  • Persist raw HTML: save responses to debug/page-001.html when parsing yields zero items.
  • Log status codes and timing: a sudden spike of 429 often means you need slower pacing or fewer concurrent requests.
  • Detect “bot walls”: if HTML length becomes tiny or contains “enable JavaScript,” you may be hitting a challenge page (don’t try to bypass protections—use permitted APIs or ask for access).

Scraping Etiquette and Safety

Scraping is not just a technical problem. Keep it responsible:

  • Check the site’s terms and robots guidance, and prefer official APIs when available.
  • Use a clear User-Agent, and keep request rates low.
  • Cache responses during development to avoid hammering servers while debugging.
  • Never scrape personal data you don’t have the right to collect or store.

Next Steps: Make It Production-Ready

If you want to level this up:

  • Add a visited_urls table and resume capability (store the next URL and last run time).
  • Export to CSV/JSON with a tiny CLI flag (argparse).
  • Run it on a schedule (cron/GitHub Actions) with notifications if item counts drop.
  • Wrap parsing logic per site into “adapters” so you can support multiple sources cleanly.

With a reliable fetch layer, resilient selectors, normalization, and a storage pipeline, you’ll spend less time babysitting scrapes and more time using the data. Swap in your target site’s selectors and pagination rules, and you’ve got a practical scraper you can iterate on.


Leave a Reply

Your email address will not be published. Required fields are marked *