Python Web Scraping You Can Actually Maintain: Requests + BeautifulSoup + Retries + Storage

Python Web Scraping You Can Actually Maintain: Requests + BeautifulSoup + Retries + Storage

Web scraping is easy to demo and surprisingly easy to break in production. A small HTML change, a temporary 503, or getting rate-limited can turn a “quick script” into a flaky mess. This hands-on guide shows a practical pattern for junior/mid developers: fetch pages reliably, parse safely, paginate, and persist results in a way you can rerun without duplicating data.

We’ll scrape a public practice site (https://quotes.toscrape.com) so you can run the code immediately.

1) Install dependencies

You only need two libraries:

  • requests for HTTP
  • beautifulsoup4 for parsing HTML
pip install requests beautifulsoup4

2) A reliable HTTP client: session, timeouts, and headers

Use a requests.Session() to reuse connections (faster) and set consistent headers. Always set a timeout—otherwise your scraper can hang forever.

import requests BASE_URL = "https://quotes.toscrape.com" def make_session() -> requests.Session: session = requests.Session() session.headers.update({ # Some sites block “default” clients; a realistic UA helps. "User-Agent": "Mozilla/5.0 (compatible; PracticalScraper/1.0; +https://example.com/bot)", "Accept-Language": "en-US,en;q=0.9", }) return session

Tip: Don’t pretend to be a browser to bypass protections you’re not allowed to bypass. Use a sensible User-Agent and follow the site’s policies.

3) Fetching pages with retries (the part most scripts forget)

Networks fail. Servers hiccup. Your script shouldn’t crash on the first 502. A simple retry loop with exponential backoff will make your scraper dramatically more stable.

import time from typing import Optional def fetch_html(session: requests.Session, url: str, *, timeout: int = 15, max_retries: int = 4) -> str: last_error: Optional[Exception] = None for attempt in range(1, max_retries + 1): try: resp = session.get(url, timeout=timeout) # Retry on common transient statuses if resp.status_code in (429, 500, 502, 503, 504): raise requests.HTTPError(f"Transient HTTP {resp.status_code} for {url}") resp.raise_for_status() return resp.text except (requests.RequestException, requests.HTTPError) as e: last_error = e # Exponential backoff: 1s, 2s, 4s, 8s... sleep_s = 2 ** (attempt - 1) time.sleep(sleep_s) raise RuntimeError(f"Failed to fetch {url} after {max_retries} retries: {last_error}")

If you’re scraping many pages, add a small delay between requests (even when successful) to reduce load and avoid rate limits.

4) Parse HTML defensively with BeautifulSoup

HTML changes. Write parsing code that:

  • Targets stable containers (IDs/classes) when possible
  • Handles missing fields without crashing
  • Normalizes whitespace

On the quotes site, each quote is inside a div.quote. We’ll extract the quote text, author, and tags.

from bs4 import BeautifulSoup from dataclasses import dataclass @dataclass(frozen=True) class Quote: text: str author: str tags: tuple[str, ...] def parse_quotes(html: str) -> tuple[list[Quote], str | None]: soup = BeautifulSoup(html, "html.parser") quotes: list[Quote] = [] for q in soup.select("div.quote"): text_el = q.select_one("span.text") author_el = q.select_one("small.author") tag_els = q.select("div.tags a.tag") # Defensive parsing: skip items that don't match expected structure if not text_el or not author_el: continue text = text_el.get_text(strip=True) author = author_el.get_text(strip=True) tags = tuple(t.get_text(strip=True) for t in tag_els) quotes.append(Quote(text=text, author=author, tags=tags)) # Pagination: the "Next" button is an <li class="next"> next_link = soup.select_one("li.next a") next_href = next_link["href"] if next_link and next_link.has_attr("href") else None return quotes, next_href

5) Pagination loop: crawl pages until there’s no “Next”

Now we connect fetching + parsing into a crawl loop. We’ll build absolute URLs and keep going until the next link disappears.

from urllib.parse import urljoin def crawl_all_quotes(session: requests.Session) -> list[Quote]: url = BASE_URL all_quotes: list[Quote] = [] while url: html = fetch_html(session, url) quotes, next_href = parse_quotes(html) all_quotes.extend(quotes) # Respectful pacing (adjust as needed) time.sleep(0.5) url = urljoin(BASE_URL, next_href) if next_href else None return all_quotes

This pattern is reusable: fetchparse → follow next.

6) Persist results safely: SQLite with “upsert” behavior

For real projects, saving to a CSV is fine, but it’s easy to create duplicates when rerunning. SQLite is a great middle ground: no server, easy dedupe, and queryable.

We’ll store quotes with a unique constraint on (text, author) so reruns don’t duplicate rows.

import sqlite3 def init_db(conn: sqlite3.Connection) -> None: conn.execute(""" CREATE TABLE IF NOT EXISTS quotes ( id INTEGER PRIMARY KEY AUTOINCREMENT, text TEXT NOT NULL, author TEXT NOT NULL, tags TEXT NOT NULL, UNIQUE(text, author) ) """) conn.commit() def save_quotes(conn: sqlite3.Connection, quotes: list[Quote]) -> int: inserted = 0 for q in quotes: # Store tags as a comma-separated string (simple and effective) tags_str = ",".join(q.tags) try: conn.execute( "INSERT INTO quotes (text, author, tags) VALUES (?, ?, ?)", (q.text, q.author, tags_str), ) inserted += 1 except sqlite3.IntegrityError: # Duplicate due to UNIQUE constraint - ignore pass conn.commit() return inserted

7) A complete runnable script

Put everything together into a single file (for example, scrape_quotes.py) and run it.

import time import sqlite3 import requests from urllib.parse import urljoin from typing import Optional from bs4 import BeautifulSoup from dataclasses import dataclass BASE_URL = "https://quotes.toscrape.com" def make_session() -> requests.Session: session = requests.Session() session.headers.update({ "User-Agent": "Mozilla/5.0 (compatible; PracticalScraper/1.0; +https://example.com/bot)", "Accept-Language": "en-US,en;q=0.9", }) return session def fetch_html(session: requests.Session, url: str, *, timeout: int = 15, max_retries: int = 4) -> str: last_error: Optional[Exception] = None for attempt in range(1, max_retries + 1): try: resp = session.get(url, timeout=timeout) if resp.status_code in (429, 500, 502, 503, 504): raise requests.HTTPError(f"Transient HTTP {resp.status_code} for {url}") resp.raise_for_status() return resp.text except (requests.RequestException, requests.HTTPError) as e: last_error = e time.sleep(2 ** (attempt - 1)) raise RuntimeError(f"Failed to fetch {url} after {max_retries} retries: {last_error}") @dataclass(frozen=True) class Quote: text: str author: str tags: tuple[str, ...] def parse_quotes(html: str) -> tuple[list[Quote], str | None]: soup = BeautifulSoup(html, "html.parser") quotes: list[Quote] = [] for q in soup.select("div.quote"): text_el = q.select_one("span.text") author_el = q.select_one("small.author") tag_els = q.select("div.tags a.tag") if not text_el or not author_el: continue quotes.append(Quote( text=text_el.get_text(strip=True), author=author_el.get_text(strip=True), tags=tuple(t.get_text(strip=True) for t in tag_els), )) next_link = soup.select_one("li.next a") next_href = next_link["href"] if next_link and next_link.has_attr("href") else None return quotes, next_href def init_db(conn: sqlite3.Connection) -> None: conn.execute(""" CREATE TABLE IF NOT EXISTS quotes ( id INTEGER PRIMARY KEY AUTOINCREMENT, text TEXT NOT NULL, author TEXT NOT NULL, tags TEXT NOT NULL, UNIQUE(text, author) ) """) conn.commit() def save_quotes(conn: sqlite3.Connection, quotes: list[Quote]) -> int: inserted = 0 for q in quotes: tags_str = ",".join(q.tags) try: conn.execute( "INSERT INTO quotes (text, author, tags) VALUES (?, ?, ?)", (q.text, q.author, tags_str), ) inserted += 1 except sqlite3.IntegrityError: pass conn.commit() return inserted def crawl_all_quotes(session: requests.Session) -> list[Quote]: url: str | None = BASE_URL all_quotes: list[Quote] = [] while url: html = fetch_html(session, url) quotes, next_href = parse_quotes(html) all_quotes.extend(quotes) time.sleep(0.5) # be polite url = urljoin(BASE_URL, next_href) if next_href else None return all_quotes def main() -> None: session = make_session() quotes = crawl_all_quotes(session) conn = sqlite3.connect("quotes.db") init_db(conn) inserted = save_quotes(conn, quotes) conn.close() print(f"Scraped {len(quotes)} quotes. Inserted {inserted} new rows into quotes.db.") if __name__ == "__main__": main()

After running, you can inspect the DB:

sqlite3 quotes.db "SELECT author, COUNT(*) FROM quotes GROUP BY author ORDER BY COUNT(*) DESC LIMIT 5;"

8) Practical checks before you scrape real sites

  • Read the site’s rules: check robots.txt and Terms. Don’t scrape content you’re not allowed to access.
  • Log what matters: URL, status codes, retry counts, and parse failures.
  • Prefer stable selectors: IDs/classes on a container usually break less than deeply nested selectors.
  • Design for reruns: use unique constraints, timestamps, and incremental crawling if you’ll run daily.
  • If the site is JS-rendered: don’t jump straight to browser automation—first look for underlying JSON/XHR endpoints in DevTools Network. If there’s truly no API, consider Playwright (headless browser) with the same patterns: retries, timeouts, and persistence.

Wrap-up

A maintainable scraper is mostly boring engineering: timeouts, retries, defensive parsing, pagination, and deduped storage. If you adopt this pattern early, your “quick script” becomes a tool you can run repeatedly without babysitting.


Leave a Reply

Your email address will not be published. Required fields are marked *