Python Web Scraping for Real Projects: Polite Crawling, Robust Parsing, and a “Save-As-You-Go” Pipeline
Web scraping tutorials often stop at “download a page and parse a title.” In real projects you need more: retries, rate limiting, caching, deduping, and a way to store partial progress so one failure doesn’t waste an hour of work.
This hands-on guide builds a small but production-minded scraping pipeline in Python using requests + BeautifulSoup. You’ll learn how to:
- Fetch pages safely (timeouts, retries, backoff, user-agent, sessions)
- Be polite (rate limiting, robots awareness)
- Parse reliably (selectors, cleanup, defensive extraction)
- Persist results incrementally (SQLite) and resume without duplication
- Avoid re-downloading with HTTP caching
Note: Scrape only what you’re allowed to. Always check a site’s terms, and respect robots.txt and rate limits.
Setup
Install dependencies:
python -m pip install requests beautifulsoup4 requests-cache tenacity
requests: HTTP clientbeautifulsoup4: HTML parsingrequests-cache: transparent caching to avoid re-downloadingtenacity: clean retry/backoff logic
1) Build a Safe HTTP Client (Session + Timeouts + Retries)
In scraping, “works on my machine” often means “fails randomly in production.” Networks are flaky and servers throttle. Use:
- timeouts (always)
- a session (connection pooling)
- retries with exponential backoff for transient errors
- clear headers (user agent, accept language)
import time import random import requests import requests_cache from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type from requests.exceptions import RequestException, Timeout # Enable caching (SQLite file). Cached responses are reused automatically. requests_cache.install_cache( cache_name="http_cache", backend="sqlite", expire_after=60 * 30, # 30 minutes ) session = requests.Session() session.headers.update({ "User-Agent": "MyScraperBot/1.0 (+https://example.com/contact)", "Accept": "text/html,application/xhtml+xml", "Accept-Language": "en-US,en;q=0.8", }) def polite_sleep(min_s=0.4, max_s=1.2): # Small jitter helps avoid looking like a bot hammering at exact intervals. time.sleep(random.uniform(min_s, max_s)) @retry( reraise=True, stop=stop_after_attempt(4), wait=wait_exponential_jitter(initial=1, max=10), retry=retry_if_exception_type((RequestException, Timeout)), ) def fetch(url: str) -> str: polite_sleep() resp = session.get(url, timeout=(5, 20)) # (connect timeout, read timeout) resp.raise_for_status() return resp.text
requests-cache makes reruns faster and kinder. If you’re iterating on parsing logic, you won’t hit the site repeatedly.
2) (Optional but Recommended) Check robots.txt Before Scraping
You don’t need a full crawler framework to be respectful. Python’s standard library can read robots.txt rules.
from urllib.parse import urlparse from urllib.robotparser import RobotFileParser _robot_parsers = {} def can_fetch(url: str, user_agent: str = "*") -> bool: parsed = urlparse(url) base = f"{parsed.scheme}://{parsed.netloc}" robots_url = f"{base}/robots.txt" rp = _robot_parsers.get(base) if rp is None: rp = RobotFileParser() rp.set_url(robots_url) try: rp.read() except Exception: # If robots.txt is unreachable, decide your policy: # return False for strict, True for permissive. return False _robot_parsers[base] = rp return rp.can_fetch(user_agent, url)
Before fetching a page, call can_fetch(url). If it returns False, skip it.
3) Parse HTML Defensively (Selectors + Cleanup + Defaults)
HTML changes. Elements disappear. Classes get renamed. A robust parser avoids crashing and produces reasonable defaults.
Here’s a practical parser that extracts a page title, canonical URL, and a “price” if present (common e-commerce pattern). Adjust selectors to your target site.
from bs4 import BeautifulSoup from urllib.parse import urljoin def text_or_none(el): if not el: return None return " ".join(el.get_text(strip=True).split()) def parse_product_page(html: str, page_url: str) -> dict: soup = BeautifulSoup(html, "html.parser") title = text_or_none(soup.select_one("h1")) or text_or_none(soup.select_one("title")) canonical_el = soup.select_one('link[rel="canonical"]') canonical = canonical_el["href"].strip() if canonical_el and canonical_el.has_attr("href") else page_url canonical = urljoin(page_url, canonical) # Example price selectors (pick what matches your site) price = ( text_or_none(soup.select_one('[itemprop="price"]')) or text_or_none(soup.select_one(".price")) or text_or_none(soup.select_one('[data-test="price"]')) ) return { "url": canonical, "title": title, "price": price, }
Key habits:
- Prefer semantic selectors when available (
itemprop,data-test) - Normalize whitespace (
" ".join(...)) - Always return keys even if values are
None
4) Discover Links Safely (Stay on Domain + Avoid Junk URLs)
If you’re crawling multiple pages, you need a link extractor. Keep it constrained:
- Stay on the same host (avoid accidental off-site crawling)
- Skip fragments (
#section) and common non-HTML targets (images, PDFs) - Normalize to absolute URLs
from urllib.parse import urlparse, urljoin SKIP_EXTENSIONS = (".jpg", ".jpeg", ".png", ".gif", ".svg", ".pdf", ".zip") def same_host(a: str, b: str) -> bool: return urlparse(a).netloc == urlparse(b).netloc def extract_links(html: str, page_url: str) -> list[str]: soup = BeautifulSoup(html, "html.parser") links = [] for a in soup.select("a[href]"): href = a["href"].strip() if not href or href.startswith("#"): continue abs_url = urljoin(page_url, href) abs_url = abs_url.split("#", 1)[0] # drop fragment if not same_host(abs_url, page_url): continue if abs_url.lower().endswith(SKIP_EXTENSIONS): continue links.append(abs_url) return links
5) Persist Incrementally with SQLite (Resume + Deduplicate)
Scrapers should be restartable. If your script crashes at page 450/1000, you want to resume at 451 without reprocessing everything.
SQLite is perfect for this: zero setup, fast enough for many workloads, and supports unique constraints.
import sqlite3 def init_db(db_path="scrape.db") -> sqlite3.Connection: conn = sqlite3.connect(db_path) conn.execute(""" CREATE TABLE IF NOT EXISTS pages ( url TEXT PRIMARY KEY, title TEXT, price TEXT, scraped_at TEXT DEFAULT (datetime('now')) ) """) conn.execute(""" CREATE TABLE IF NOT EXISTS queue ( url TEXT PRIMARY KEY ) """) conn.commit() return conn def enqueue(conn: sqlite3.Connection, url: str): conn.execute("INSERT OR IGNORE INTO queue(url) VALUES (?)", (url,)) conn.commit() def dequeue_batch(conn: sqlite3.Connection, n=20) -> list[str]: rows = conn.execute("SELECT url FROM queue LIMIT ?", (n,)).fetchall() urls = [r[0] for r in rows] conn.executemany("DELETE FROM queue WHERE url = ?", [(u,) for u in urls]) conn.commit() return urls def save_page(conn: sqlite3.Connection, item: dict): conn.execute( "INSERT OR REPLACE INTO pages(url, title, price) VALUES (?, ?, ?)", (item["url"], item["title"], item["price"]) ) conn.commit() def already_scraped(conn: sqlite3.Connection, url: str) -> bool: row = conn.execute("SELECT 1 FROM pages WHERE url = ? LIMIT 1", (url,)).fetchone() return row is not None
6) Put It Together: A Small, Polite Crawler
This example starts from a seed URL, scrapes pages, discovers links, and stores results. It stays on-domain and respects robots.txt.
def crawl(seed_url: str, max_pages: int = 200): conn = init_db() enqueue(conn, seed_url) processed = 0 while processed < max_pages: batch = dequeue_batch(conn, n=10) if not batch: break for url in batch: if processed >= max_pages: break if already_scraped(conn, url): continue if not can_fetch(url): continue try: html = fetch(url) except Exception as e: # Log and continue. In real projects, store errors in a table. print(f"Fetch failed: {url} ({e})") continue item = parse_product_page(html, url) save_page(conn, item) processed += 1 # Discover and enqueue more URLs for link in extract_links(html, url): if not already_scraped(conn, link): enqueue(conn, link) cached = getattr(session.get(url), "from_cache", False) print(f"[{processed}] {item['url']} title={item['title']!r} cache={cached}") conn.close() if __name__ == "__main__": crawl("https://example.com/products", max_pages=100)
Tip: Don’t call session.get(url) again just to check from_cache as shown above (that line would re-request). In your real code, check caching on the response you already fetched. The print line is included to show the idea; you can restructure fetch to return the response object instead of text.
Common Improvements You Can Add Next
- Return
Responsefromfetch: store status code, headers, andfrom_cachemetadata. - ETag/Last-Modified support: caching libraries handle much of this, but you can also store headers per URL.
- Error table: store failures and retry later (useful when a site is temporarily down).
- Content-type checks: skip non-HTML responses by inspecting
Content-Type. - Parallelism carefully: start with sequential; if you add concurrency, keep the same politeness rules and limit the number of simultaneous requests.
Debugging Checklist (When Scraping “Doesn’t Work”)
- Are you being blocked? Try a slower rate and confirm your
User-Agentis set. - Are selectors correct? Save sample HTML to disk and inspect it locally.
- Is content loaded via JavaScript? If the HTML response lacks the data, you may need a JSON endpoint (often used by the site) or a browser-based approach.
- Are you missing timeouts? Without timeouts, your script can hang indefinitely on one request.
Wrap-Up
You now have a practical Python scraping foundation: a robust HTTP client, defensive parsing, link discovery, caching, and a persistent queue + results store to resume safely. This architecture scales surprisingly far for internal tools and small-to-mid scraping jobs—and it’s a solid base before you move to heavier crawling frameworks.
Leave a Reply