Python Web Scraping for Real Projects: Polite Crawling, Robust Parsing, and a “Save-As-You-Go” Pipeline

Python Web Scraping for Real Projects: Polite Crawling, Robust Parsing, and a “Save-As-You-Go” Pipeline

Web scraping tutorials often stop at “download a page and parse a title.” In real projects you need more: retries, rate limiting, caching, deduping, and a way to store partial progress so one failure doesn’t waste an hour of work.

This hands-on guide builds a small but production-minded scraping pipeline in Python using requests + BeautifulSoup. You’ll learn how to:

  • Fetch pages safely (timeouts, retries, backoff, user-agent, sessions)
  • Be polite (rate limiting, robots awareness)
  • Parse reliably (selectors, cleanup, defensive extraction)
  • Persist results incrementally (SQLite) and resume without duplication
  • Avoid re-downloading with HTTP caching

Note: Scrape only what you’re allowed to. Always check a site’s terms, and respect robots.txt and rate limits.

Setup

Install dependencies:

python -m pip install requests beautifulsoup4 requests-cache tenacity
  • requests: HTTP client
  • beautifulsoup4: HTML parsing
  • requests-cache: transparent caching to avoid re-downloading
  • tenacity: clean retry/backoff logic

1) Build a Safe HTTP Client (Session + Timeouts + Retries)

In scraping, “works on my machine” often means “fails randomly in production.” Networks are flaky and servers throttle. Use:

  • timeouts (always)
  • a session (connection pooling)
  • retries with exponential backoff for transient errors
  • clear headers (user agent, accept language)
import time import random import requests import requests_cache from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type from requests.exceptions import RequestException, Timeout # Enable caching (SQLite file). Cached responses are reused automatically. requests_cache.install_cache( cache_name="http_cache", backend="sqlite", expire_after=60 * 30, # 30 minutes ) session = requests.Session() session.headers.update({ "User-Agent": "MyScraperBot/1.0 (+https://example.com/contact)", "Accept": "text/html,application/xhtml+xml", "Accept-Language": "en-US,en;q=0.8", }) def polite_sleep(min_s=0.4, max_s=1.2): # Small jitter helps avoid looking like a bot hammering at exact intervals. time.sleep(random.uniform(min_s, max_s)) @retry( reraise=True, stop=stop_after_attempt(4), wait=wait_exponential_jitter(initial=1, max=10), retry=retry_if_exception_type((RequestException, Timeout)), ) def fetch(url: str) -> str: polite_sleep() resp = session.get(url, timeout=(5, 20)) # (connect timeout, read timeout) resp.raise_for_status() return resp.text

requests-cache makes reruns faster and kinder. If you’re iterating on parsing logic, you won’t hit the site repeatedly.

2) (Optional but Recommended) Check robots.txt Before Scraping

You don’t need a full crawler framework to be respectful. Python’s standard library can read robots.txt rules.

from urllib.parse import urlparse from urllib.robotparser import RobotFileParser _robot_parsers = {} def can_fetch(url: str, user_agent: str = "*") -> bool: parsed = urlparse(url) base = f"{parsed.scheme}://{parsed.netloc}" robots_url = f"{base}/robots.txt" rp = _robot_parsers.get(base) if rp is None: rp = RobotFileParser() rp.set_url(robots_url) try: rp.read() except Exception: # If robots.txt is unreachable, decide your policy: # return False for strict, True for permissive. return False _robot_parsers[base] = rp return rp.can_fetch(user_agent, url)

Before fetching a page, call can_fetch(url). If it returns False, skip it.

3) Parse HTML Defensively (Selectors + Cleanup + Defaults)

HTML changes. Elements disappear. Classes get renamed. A robust parser avoids crashing and produces reasonable defaults.

Here’s a practical parser that extracts a page title, canonical URL, and a “price” if present (common e-commerce pattern). Adjust selectors to your target site.

from bs4 import BeautifulSoup from urllib.parse import urljoin def text_or_none(el): if not el: return None return " ".join(el.get_text(strip=True).split()) def parse_product_page(html: str, page_url: str) -> dict: soup = BeautifulSoup(html, "html.parser") title = text_or_none(soup.select_one("h1")) or text_or_none(soup.select_one("title")) canonical_el = soup.select_one('link[rel="canonical"]') canonical = canonical_el["href"].strip() if canonical_el and canonical_el.has_attr("href") else page_url canonical = urljoin(page_url, canonical) # Example price selectors (pick what matches your site) price = ( text_or_none(soup.select_one('[itemprop="price"]')) or text_or_none(soup.select_one(".price")) or text_or_none(soup.select_one('[data-test="price"]')) ) return { "url": canonical, "title": title, "price": price, }

Key habits:

  • Prefer semantic selectors when available (itemprop, data-test)
  • Normalize whitespace (" ".join(...))
  • Always return keys even if values are None

4) Discover Links Safely (Stay on Domain + Avoid Junk URLs)

If you’re crawling multiple pages, you need a link extractor. Keep it constrained:

  • Stay on the same host (avoid accidental off-site crawling)
  • Skip fragments (#section) and common non-HTML targets (images, PDFs)
  • Normalize to absolute URLs
from urllib.parse import urlparse, urljoin SKIP_EXTENSIONS = (".jpg", ".jpeg", ".png", ".gif", ".svg", ".pdf", ".zip") def same_host(a: str, b: str) -> bool: return urlparse(a).netloc == urlparse(b).netloc def extract_links(html: str, page_url: str) -> list[str]: soup = BeautifulSoup(html, "html.parser") links = [] for a in soup.select("a[href]"): href = a["href"].strip() if not href or href.startswith("#"): continue abs_url = urljoin(page_url, href) abs_url = abs_url.split("#", 1)[0] # drop fragment if not same_host(abs_url, page_url): continue if abs_url.lower().endswith(SKIP_EXTENSIONS): continue links.append(abs_url) return links

5) Persist Incrementally with SQLite (Resume + Deduplicate)

Scrapers should be restartable. If your script crashes at page 450/1000, you want to resume at 451 without reprocessing everything.

SQLite is perfect for this: zero setup, fast enough for many workloads, and supports unique constraints.

import sqlite3 def init_db(db_path="scrape.db") -> sqlite3.Connection: conn = sqlite3.connect(db_path) conn.execute(""" CREATE TABLE IF NOT EXISTS pages ( url TEXT PRIMARY KEY, title TEXT, price TEXT, scraped_at TEXT DEFAULT (datetime('now')) ) """) conn.execute(""" CREATE TABLE IF NOT EXISTS queue ( url TEXT PRIMARY KEY ) """) conn.commit() return conn def enqueue(conn: sqlite3.Connection, url: str): conn.execute("INSERT OR IGNORE INTO queue(url) VALUES (?)", (url,)) conn.commit() def dequeue_batch(conn: sqlite3.Connection, n=20) -> list[str]: rows = conn.execute("SELECT url FROM queue LIMIT ?", (n,)).fetchall() urls = [r[0] for r in rows] conn.executemany("DELETE FROM queue WHERE url = ?", [(u,) for u in urls]) conn.commit() return urls def save_page(conn: sqlite3.Connection, item: dict): conn.execute( "INSERT OR REPLACE INTO pages(url, title, price) VALUES (?, ?, ?)", (item["url"], item["title"], item["price"]) ) conn.commit() def already_scraped(conn: sqlite3.Connection, url: str) -> bool: row = conn.execute("SELECT 1 FROM pages WHERE url = ? LIMIT 1", (url,)).fetchone() return row is not None

6) Put It Together: A Small, Polite Crawler

This example starts from a seed URL, scrapes pages, discovers links, and stores results. It stays on-domain and respects robots.txt.

def crawl(seed_url: str, max_pages: int = 200): conn = init_db() enqueue(conn, seed_url) processed = 0 while processed < max_pages: batch = dequeue_batch(conn, n=10) if not batch: break for url in batch: if processed >= max_pages: break if already_scraped(conn, url): continue if not can_fetch(url): continue try: html = fetch(url) except Exception as e: # Log and continue. In real projects, store errors in a table. print(f"Fetch failed: {url} ({e})") continue item = parse_product_page(html, url) save_page(conn, item) processed += 1 # Discover and enqueue more URLs for link in extract_links(html, url): if not already_scraped(conn, link): enqueue(conn, link) cached = getattr(session.get(url), "from_cache", False) print(f"[{processed}] {item['url']} title={item['title']!r} cache={cached}") conn.close() if __name__ == "__main__": crawl("https://example.com/products", max_pages=100)

Tip: Don’t call session.get(url) again just to check from_cache as shown above (that line would re-request). In your real code, check caching on the response you already fetched. The print line is included to show the idea; you can restructure fetch to return the response object instead of text.

Common Improvements You Can Add Next

  • Return Response from fetch: store status code, headers, and from_cache metadata.
  • ETag/Last-Modified support: caching libraries handle much of this, but you can also store headers per URL.
  • Error table: store failures and retry later (useful when a site is temporarily down).
  • Content-type checks: skip non-HTML responses by inspecting Content-Type.
  • Parallelism carefully: start with sequential; if you add concurrency, keep the same politeness rules and limit the number of simultaneous requests.

Debugging Checklist (When Scraping “Doesn’t Work”)

  • Are you being blocked? Try a slower rate and confirm your User-Agent is set.
  • Are selectors correct? Save sample HTML to disk and inspect it locally.
  • Is content loaded via JavaScript? If the HTML response lacks the data, you may need a JSON endpoint (often used by the site) or a browser-based approach.
  • Are you missing timeouts? Without timeouts, your script can hang indefinitely on one request.

Wrap-Up

You now have a practical Python scraping foundation: a robust HTTP client, defensive parsing, link discovery, caching, and a persistent queue + results store to resume safely. This architecture scales surprisingly far for internal tools and small-to-mid scraping jobs—and it’s a solid base before you move to heavier crawling frameworks.


Leave a Reply

Your email address will not be published. Required fields are marked *