Python Web Scraping in Practice: Resilient Crawls with Requests, BeautifulSoup, and a SQLite Pipeline
Web scraping is easy when a page is simple—and frustrating when you hit rate limits, flaky HTML, and messy data. This hands-on guide shows a practical scraping workflow that junior/mid developers can ship: fetch pages reliably, parse consistently, avoid common traps, and store results in a database so you can resume and iterate.
What you’ll build: a small scraper that collects product-like cards (title, price, link) from one or more pages, normalizes the data, and saves it to SQLite with deduping and safe retries.
Project Setup
Create a virtual environment and install dependencies:
python -m venv .venv # Windows: .venv\Scripts\activate source .venv/bin/activate pip install requests beautifulsoup4 lxml tenacity
We’ll use:
requestsfor HTTPBeautifulSoup+lxmlfor parsingtenacityfor retries with backoffsqlite3(built-in) for storage
Step 1: Fetch Pages Like You Mean It
Most scraping failures are “network-ish”: timeouts, 429s, transient 5xx errors. You want a single function that handles headers, timeouts, retries, and respectful pacing.
import time import random import requests from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type DEFAULT_HEADERS = { "User-Agent": "Mozilla/5.0 (compatible; MiniScraper/1.0; +https://example.com/bot)", "Accept-Language": "en-US,en;q=0.9", } class FetchError(Exception): pass @retry( stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=1, max=20), retry=retry_if_exception_type((requests.RequestException, FetchError)), ) def fetch_html(url: str, session: requests.Session) -> str: resp = session.get(url, headers=DEFAULT_HEADERS, timeout=(5, 20)) # Handle common “try again later” scenarios if resp.status_code in (429, 500, 502, 503, 504): raise FetchError(f"Retryable status: {resp.status_code}") resp.raise_for_status() return resp.text def polite_sleep(min_s=0.7, max_s=1.6): time.sleep(random.uniform(min_s, max_s))
Why this matters: retries should be centralized so every caller gets consistent behavior. Also, timeout=(connect, read) prevents hanging forever.
Step 2: Parse HTML with Stable Selectors
Your parsing will break if you anchor on fragile structure. Prefer stable attributes like data-*, semantic class names, or repeated card containers.
Below is a parser that expects a “card” container and tries multiple selector fallbacks. Adapt selectors to your target site (and make sure you’re allowed to scrape it).
from bs4 import BeautifulSoup from urllib.parse import urljoin def parse_items(html: str, base_url: str) -> list[dict]: soup = BeautifulSoup(html, "lxml") # Example: cards could be <div class="product-card">...</div> cards = soup.select(".product-card, [data-testid='product-card'], .card.product") items = [] for card in cards: # Title fallbacks title_el = card.select_one(".title, .product-title, [data-testid='title'], h2, h3") title = title_el.get_text(strip=True) if title_el else None # Price fallbacks price_el = card.select_one(".price, .product-price, [data-testid='price']") price_text = price_el.get_text(" ", strip=True) if price_el else None # Link fallbacks link_el = card.select_one("a[href]") href = link_el["href"] if link_el and link_el.has_attr("href") else None link = urljoin(base_url, href) if href else None if not title: # Skip incomplete cards; you can also log these for debugging. continue items.append({ "title": title, "price_text": price_text, "link": link, }) return items
Tip: During development, save HTML snapshots when parsing fails. It’s the fastest way to diagnose selector drift.
Step 3: Normalize Data (Prices, URLs, and Text)
Raw scraped fields are inconsistent. Normalize early so downstream code stays simple. Here’s a basic price parser that pulls the first decimal-ish number out of a string.
import re from decimal import Decimal, InvalidOperation _price_re = re.compile(r"(\d{1,3}(?:[,\s]\d{3})*(?:\.\d+)?|\d+(?:\.\d+)?)") def normalize_price(price_text: str | None) -> Decimal | None: if not price_text: return None m = _price_re.search(price_text) if not m: return None raw = m.group(1).replace(" ", "").replace(",", "") try: return Decimal(raw) except InvalidOperation: return None
Now enrich items with a price field:
def enrich(items: list[dict]) -> list[dict]: out = [] for it in items: out.append({ **it, "price": normalize_price(it.get("price_text")), }) return out
Step 4: Store Results in SQLite (With Deduping)
Saving to a DB gives you: deduping, resumability, and easy exports. We’ll dedupe by link if present, otherwise by title.
import sqlite3 from datetime import datetime def init_db(conn: sqlite3.Connection): conn.execute(""" CREATE TABLE IF NOT EXISTS items ( id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT NOT NULL, price REAL, price_text TEXT, link TEXT, fingerprint TEXT NOT NULL UNIQUE, scraped_at TEXT NOT NULL ); """) conn.commit() def fingerprint(item: dict) -> str: # Prefer stable link-based identity return (item.get("link") or item["title"]).strip().lower() def upsert_items(conn: sqlite3.Connection, items: list[dict]): now = datetime.utcnow().isoformat() rows = [] for it in items: rows.append(( it["title"], float(it["price"]) if it.get("price") is not None else None, it.get("price_text"), it.get("link"), fingerprint(it), now )) conn.executemany(""" INSERT OR IGNORE INTO items (title, price, price_text, link, fingerprint, scraped_at) VALUES (?, ?, ?, ?, ?, ?); """, rows) conn.commit()
Why INSERT OR IGNORE? It’s a simple way to avoid duplicates without writing complex merge logic. For more advanced use cases, add an UPDATE path when data changes.
Step 5: Crawl Multiple Pages (Pagination)
Pagination is where scrapers become “real.” Here’s a flexible approach: you provide a function that finds the “next page” URL, or you generate URLs for known page patterns.
Option A: follow “next” links:
from urllib.parse import urljoin def find_next_page(html: str, base_url: str) -> str | None: soup = BeautifulSoup(html, "lxml") next_el = soup.select_one("a[rel='next'], a.next, [data-testid='next']") if not next_el or not next_el.has_attr("href"): return None return urljoin(base_url, next_el["href"])
Option B: generate URLs by page number:
def page_urls(template: str, start: int, end: int) -> list[str]: # template example: "https://example.com/products?page={page}" return [template.format(page=i) for i in range(start, end + 1)]
Putting It Together: A Minimal End-to-End Scraper
This script crawls starting from one URL, follows “next” links up to a limit, parses items, normalizes them, and saves them.
import requests import sqlite3 def run(start_url: str, db_path: str = "scrape.db", max_pages: int = 5): with requests.Session() as session, sqlite3.connect(db_path) as conn: init_db(conn) url = start_url pages = 0 while url and pages < max_pages: print(f"Fetching: {url}") html = fetch_html(url, session=session) items = parse_items(html, base_url=url) items = enrich(items) upsert_items(conn, items) print(f" saved {len(items)} items (page {pages + 1})") url = find_next_page(html, base_url=url) pages += 1 polite_sleep() total = conn.execute("SELECT COUNT(*) FROM items").fetchone()[0] print(f"Done. Total stored items: {total}") if __name__ == "__main__": # Replace with a real URL you are allowed to scrape. run("https://example.com/products")
Debugging Tricks That Save Hours
- Print the first card HTML: when selectors fail, inspect
cards[0].prettify()to see the real structure. - Persist raw HTML: save responses to
debug/page-001.htmlwhen parsing yields zero items. - Log status codes and timing: a sudden spike of
429often means you need slower pacing or fewer concurrent requests. - Detect “bot walls”: if HTML length becomes tiny or contains “enable JavaScript,” you may be hitting a challenge page (don’t try to bypass protections—use permitted APIs or ask for access).
Scraping Etiquette and Safety
Scraping is not just a technical problem. Keep it responsible:
- Check the site’s terms and robots guidance, and prefer official APIs when available.
- Use a clear
User-Agent, and keep request rates low. - Cache responses during development to avoid hammering servers while debugging.
- Never scrape personal data you don’t have the right to collect or store.
Next Steps: Make It Production-Ready
If you want to level this up:
- Add a
visited_urlstable and resume capability (store the next URL and last run time). - Export to CSV/JSON with a tiny CLI flag (
argparse). - Run it on a schedule (cron/GitHub Actions) with notifications if item counts drop.
- Wrap parsing logic per site into “adapters” so you can support multiple sources cleanly.
With a reliable fetch layer, resilient selectors, normalization, and a storage pipeline, you’ll spend less time babysitting scrapes and more time using the data. Swap in your target site’s selectors and pagination rules, and you’ve got a practical scraper you can iterate on.
Leave a Reply