Python Web Scraping in Practice: Build a Robust Scraper with Requests, BeautifulSoup, and SQLite
Web scraping sounds simple: fetch HTML, parse it, save data. In reality, pages change, requests fail, servers rate-limit you, and your “quick script” becomes fragile. This hands-on guide shows how to build a small but solid scraper in Python that can:
- Fetch pages reliably (timeouts, retries, backoff)
- Parse HTML with
BeautifulSoup - Handle pagination
- Throttle requests politely
- Store results in
SQLiteso you can resume runs
We’ll scrape a generic “quotes” style site (common in tutorials), but the patterns apply to product lists, blog archives, documentation pages, and more. Adjust selectors to your target.
Setup
Install dependencies:
python -m pip install requests beautifulsoup4
We’ll use the Python standard library for sqlite3, time, and urllib.parse.
Project Structure
One file is enough to start:
scrape_quotes.pyscraper.db(created automatically)
Step 1: A Reliable HTTP Client (Retries + Backoff + Headers)
Many scrapers fail because they assume the network is perfect. Let’s create a small fetch helper with sensible defaults:
import time import random import requests DEFAULT_HEADERS = { # Identify yourself. Some sites block empty/odd user agents. "User-Agent": "Mozilla/5.0 (compatible; DevBlogScraper/1.0; +https://example.com/bot)" } def fetch(url: str, session: requests.Session, *, timeout=15, retries=3, backoff=1.0) -> str: """ Fetch a URL and return HTML text. Retries on transient errors with exponential backoff + jitter. """ last_exc = None for attempt in range(1, retries + 1): try: resp = session.get(url, headers=DEFAULT_HEADERS, timeout=timeout) # Treat 4xx as hard failures (usually), 5xx as retryable. if 500 <= resp.status_code < 600: raise requests.HTTPError(f"Server error {resp.status_code}") resp.raise_for_status() return resp.text except (requests.RequestException, requests.HTTPError) as exc: last_exc = exc if attempt == retries: break sleep_s = backoff * (2 ** (attempt - 1)) + random.uniform(0, 0.3) time.sleep(sleep_s) raise RuntimeError(f"Failed to fetch {url} after {retries} retries: {last_exc}")
Why this matters: timeouts prevent hanging; retries handle flaky connections; backoff reduces load and helps avoid bans.
Step 2: Parse HTML with BeautifulSoup (Selectors You Can Maintain)
Parsing is where most “it worked yesterday” bugs happen. Prefer stable hooks (like IDs, semantic class names, or data attributes) over deep nested paths.
Here’s a parser for a page containing quote cards:
from bs4 import BeautifulSoup def parse_quotes(html: str): soup = BeautifulSoup(html, "html.parser") items = [] # Example structure: # <div class="quote"> # <span class="text">...</span> # <small class="author">...</small> # <div class="tags"><a class="tag">...</a></div> # </div> for card in soup.select("div.quote"): text_el = card.select_one("span.text") author_el = card.select_one("small.author") tag_els = card.select("div.tags a.tag") text = text_el.get_text(strip=True) if text_el else "" author = author_el.get_text(strip=True) if author_el else "" tags = ",".join(t.get_text(strip=True) for t in tag_els) if text and author: items.append({ "text": text, "author": author, "tags": tags }) return items
If your target page uses different markup, change the select() / select_one() selectors and keep the rest of your pipeline the same.
Step 3: Pagination (Find “Next” Links Safely)
Hardcoding ?page=2 works until the site changes. A safer approach: parse the “Next” link from the page.
from urllib.parse import urljoin def find_next_page(html: str, base_url: str) -> str | None: soup = BeautifulSoup(html, "html.parser") next_el = soup.select_one("li.next a") if not next_el: return None href = next_el.get("href") if not href: return None return urljoin(base_url, href)
urljoin() handles relative links properly (e.g., /page/2/).
Step 4: Store Results in SQLite (So You Can Resume)
Saving to a database gives you:
- Deduplication (unique constraints)
- Resumability (track visited pages)
- Easy export later
import sqlite3 def init_db(db_path="scraper.db"): conn = sqlite3.connect(db_path) conn.execute(""" CREATE TABLE IF NOT EXISTS pages ( url TEXT PRIMARY KEY, scraped_at INTEGER NOT NULL ) """) conn.execute(""" CREATE TABLE IF NOT EXISTS quotes ( id INTEGER PRIMARY KEY AUTOINCREMENT, text TEXT NOT NULL, author TEXT NOT NULL, tags TEXT NOT NULL, UNIQUE(text, author) ) """) conn.commit() return conn def mark_page_done(conn, url: str): conn.execute("INSERT OR REPLACE INTO pages(url, scraped_at) VALUES(?, strftime('%s','now'))", (url,)) conn.commit() def page_already_done(conn, url: str) -> bool: row = conn.execute("SELECT 1 FROM pages WHERE url = ? LIMIT 1", (url,)).fetchone() return row is not None def save_quotes(conn, items): conn.executemany( "INSERT OR IGNORE INTO quotes(text, author, tags) VALUES(?, ?, ?)", [(i["text"], i["author"], i["tags"]) for i in items] ) conn.commit()
Notice UNIQUE(text, author) plus INSERT OR IGNORE. This prevents duplicates if you re-run the scraper.
Step 5: Put It Together (A Polite Crawl Loop)
Now we’ll combine fetching, parsing, pagination, throttling, and persistence.
import time import random import requests def scrape(start_url: str, *, db_path="scraper.db", delay_range=(0.8, 1.8)): conn = init_db(db_path) session = requests.Session() url = start_url pages_scraped = 0 quotes_saved_before = conn.execute("SELECT COUNT(*) FROM quotes").fetchone()[0] while url: if page_already_done(conn, url): # If you want to follow pagination even for done pages, # you can still fetch+parse next link here. For simplicity, we stop. print(f"[skip] Already scraped: {url}") break print(f"[fetch] {url}") html = fetch(url, session) items = parse_quotes(html) save_quotes(conn, items) mark_page_done(conn, url) pages_scraped += 1 print(f"[ok] page={pages_scraped} items={len(items)} total_quotes={conn.execute('SELECT COUNT(*) FROM quotes').fetchone()[0]}") url = find_next_page(html, base_url=url) # Throttle: small random delay to reduce request patterns. sleep_s = random.uniform(*delay_range) time.sleep(sleep_s) quotes_saved_after = conn.execute("SELECT COUNT(*) FROM quotes").fetchone()[0] print(f"Done. Pages scraped: {pages_scraped}. New quotes: {quotes_saved_after - quotes_saved_before}.") if __name__ == "__main__": # Example target (replace with your own): scrape("https://quotes.toscrape.com/")
Tip: If you need to continue pagination even when the current page is already in the DB, remove the break and still compute url = find_next_page(...) (but be careful not to loop forever).
Step 6: Export to CSV (Optional)
Once scraped, exporting is easy:
import csv import sqlite3 def export_csv(db_path="scraper.db", out_path="quotes.csv"): conn = sqlite3.connect(db_path) rows = conn.execute("SELECT text, author, tags FROM quotes ORDER BY author, text").fetchall() with open(out_path, "w", newline="", encoding="utf-8") as f: w = csv.writer(f) w.writerow(["text", "author", "tags"]) w.writerows(rows) print(f"Exported {len(rows)} rows to {out_path}")
Common “Real World” Tweaks
-
Respect robots.txt and terms: Always check if scraping is allowed and avoid hammering the site. Add longer delays for heavier pages.
-
Handle 429 rate limits: If the server returns
429 Too Many Requests, increase backoff and delay ranges. You can also readRetry-Afterheaders. -
Use stable selectors: Prefer
data-*attributes or semantic classes. If your selectors break often, consider asking the team that owns the site for an API instead. -
Incremental scraping: Store a “last seen” timestamp or page token in SQLite so you only fetch new content.
-
Debugging HTML changes: Save raw HTML for a failing page to disk so you can inspect it without re-fetching.
What You Now Have
This approach is intentionally simple but production-minded:
- Network resilience with timeouts, retries, and backoff
- Maintainable parsing via CSS selectors
- Pagination discovery via “Next” links
- Polite crawling with jittered delays
- Resumable storage with SQLite + uniqueness constraints
From here, the next “level up” is adding structured logging, better rate-limit handling (including Retry-After), and parallel fetching for sites that explicitly permit it. But for many junior/mid dev use cases—internal tools, data collection for QA, documentation audits—this is a reliable, practical baseline you can ship.
Leave a Reply