Python Web Scraping in Practice: A Resilient Scraper with Requests, BeautifulSoup, Retries, and Caching
Web scraping is easy when the site is stable and your internet never hiccups. Real life is messier: timeouts happen, HTML changes, rate limits bite, and you end up re-downloading the same pages while debugging. In this hands-on guide, you’ll build a small-but-solid scraper in Python that:
- Fetches pages reliably with timeouts and retry/backoff
- Parses HTML with
BeautifulSoup - Caches responses to speed up development
- Extracts structured data and saves it to CSV
- Handles pagination in a clean, testable way
Ethics + legality note: Always check a website’s terms of service and robots.txt. Don’t scrape personal data, and don’t overload servers. Add delays and identify your scraper with a user agent.
1) Project setup
Create a virtual environment and install dependencies:
python -m venv .venv # macOS/Linux source .venv/bin/activate # Windows # .venv\Scripts\activate pip install requests beautifulsoup4 requests-cache
We’ll use:
requestsfor HTTPbeautifulsoup4for HTML parsingrequests-cacheto cache GET responses locally during development
2) A reliable HTTP client (timeouts, retries, backoff)
Junior scrapers often fail because they do “one request” and assume it always works. Instead, create a tiny client that retries common transient failures.
import time import random from typing import Optional, Dict import requests DEFAULT_HEADERS = { "User-Agent": "ExampleScraper/1.0 (+https://your-site.example)" } RETRY_STATUS_CODES = {429, 500, 502, 503, 504} def fetch( url: str, *, session: Optional[requests.Session] = None, headers: Optional[Dict[str, str]] = None, timeout: float = 15.0, max_retries: int = 4, backoff_base: float = 0.7, ) -> str: """ Fetch HTML with retries + exponential backoff + jitter. Returns response text or raises an exception. """ s = session or requests.Session() h = dict(DEFAULT_HEADERS) if headers: h.update(headers) last_exc: Optional[Exception] = None for attempt in range(max_retries + 1): try: resp = s.get(url, headers=h, timeout=timeout) if resp.status_code in RETRY_STATUS_CODES: raise requests.HTTPError( f"Retryable status {resp.status_code} for {url}", response=resp, ) resp.raise_for_status() return resp.text except (requests.Timeout, requests.ConnectionError, requests.HTTPError) as exc: last_exc = exc if attempt == max_retries: break # Exponential backoff with jitter sleep_s = (backoff_base * (2 ** attempt)) + random.uniform(0, 0.3) time.sleep(sleep_s) raise RuntimeError(f"Failed to fetch {url}") from last_exc
Why this matters: The best scraper isn’t the one that works once. It’s the one that keeps working when the network or server is flaky.
3) Add caching so development is fast (and kinder)
While you’re iterating, caching prevents hammering the same pages repeatedly. With requests-cache, it’s one line.
import requests_cache def make_cached_session(cache_name: str = "http_cache", expire_seconds: int = 3600): # SQLite-backed cache stored as http_cache.sqlite by default return requests_cache.CachedSession( cache_name=cache_name, expire_after=expire_seconds, allowable_methods=("GET",), )
You’ll pass this session into fetch(). During debugging, you’ll see near-instant responses after the first request.
4) Parse HTML safely with BeautifulSoup
HTML is inconsistent. Always code defensively: check elements exist, strip text, and avoid brittle selectors when possible.
Let’s say you’re scraping a listing page where each item looks like this (simplified):
<article class="product"> <h2 class="title">Widget A</h2> <span class="price">$19.99</span> <a class="details" href="/products/widget-a">Details</a> </article>
Here’s a parser that returns structured dictionaries:
from bs4 import BeautifulSoup from urllib.parse import urljoin def parse_list_page(html: str, base_url: str): soup = BeautifulSoup(html, "html.parser") items = [] for card in soup.select("article.product"): title_el = card.select_one(".title") price_el = card.select_one(".price") link_el = card.select_one("a.details") if not title_el or not link_el: # Skip malformed cards instead of crashing continue title = title_el.get_text(strip=True) price = price_el.get_text(strip=True) if price_el else None href = link_el.get("href") url = urljoin(base_url, href) if href else None items.append({ "title": title, "price": price, "url": url, }) return items
Selector tips:
- Prefer stable classes/attributes over deeply nested selectors.
- If you control the target site (internal tooling), add
data-testordata-scrapeattributes to make scraping robust. - Fail “softly” (skip items) rather than failing the whole run.
5) Pagination: loop until there’s no “next”
Most real scrapers need to traverse multiple pages. A common pattern is: parse items, find the “next” link, repeat.
def find_next_page(html: str, base_url: str): soup = BeautifulSoup(html, "html.parser") next_link = soup.select_one("a[rel='next'], a.next") if not next_link: return None href = next_link.get("href") return urljoin(base_url, href) if href else None
Now we can write a simple crawl loop with a page limit to stay safe:
def scrape_all(base_url: str, start_url: str, *, session, max_pages: int = 20): results = [] url = start_url for _ in range(max_pages): html = fetch(url, session=session) results.extend(parse_list_page(html, base_url)) next_url = find_next_page(html, base_url) if not next_url: break url = next_url # Polite delay (especially important if cache is cold) time.sleep(random.uniform(0.4, 0.9)) return results
6) Save results to CSV (and keep it clean)
CSV is a great default output for junior/mid dev workflows: it’s inspectable, easy to import, and works everywhere.
import csv def save_csv(rows, path: str): if not rows: return fieldnames = list(rows[0].keys()) with open(path, "w", newline="", encoding="utf-8") as f: writer = csv.DictWriter(f, fieldnames=fieldnames) writer.writeheader() writer.writerows(rows)
7) Put it together: a runnable script
Below is a complete script you can adapt. Replace BASE_URL and START_URL with the site you’re scraping, and update the selectors in parse_list_page to match the page.
import time import random import requests import requests_cache from bs4 import BeautifulSoup from urllib.parse import urljoin import csv BASE_URL = "https://example.com" START_URL = "https://example.com/products?page=1" DEFAULT_HEADERS = { "User-Agent": "ExampleScraper/1.0 (+https://your-site.example)" } RETRY_STATUS_CODES = {429, 500, 502, 503, 504} def make_cached_session(cache_name: str = "http_cache", expire_seconds: int = 3600): return requests_cache.CachedSession( cache_name=cache_name, expire_after=expire_seconds, allowable_methods=("GET",), ) def fetch(url: str, *, session: requests.Session, timeout: float = 15.0, max_retries: int = 4, backoff_base: float = 0.7) -> str: last_exc = None for attempt in range(max_retries + 1): try: resp = session.get(url, headers=DEFAULT_HEADERS, timeout=timeout) if resp.status_code in RETRY_STATUS_CODES: raise requests.HTTPError( f"Retryable status {resp.status_code} for {url}", response=resp, ) resp.raise_for_status() return resp.text except (requests.Timeout, requests.ConnectionError, requests.HTTPError) as exc: last_exc = exc if attempt == max_retries: break sleep_s = (backoff_base * (2 ** attempt)) + random.uniform(0, 0.3) time.sleep(sleep_s) raise RuntimeError(f"Failed to fetch {url}") from last_exc def parse_list_page(html: str, base_url: str): soup = BeautifulSoup(html, "html.parser") items = [] for card in soup.select("article.product"): title_el = card.select_one(".title") price_el = card.select_one(".price") link_el = card.select_one("a.details") if not title_el or not link_el: continue title = title_el.get_text(strip=True) price = price_el.get_text(strip=True) if price_el else None href = link_el.get("href") url = urljoin(base_url, href) if href else None items.append({"title": title, "price": price, "url": url}) return items def find_next_page(html: str, base_url: str): soup = BeautifulSoup(html, "html.parser") next_link = soup.select_one("a[rel='next'], a.next") if not next_link: return None href = next_link.get("href") return urljoin(base_url, href) if href else None def scrape_all(base_url: str, start_url: str, *, session: requests.Session, max_pages: int = 20): results = [] url = start_url for _ in range(max_pages): html = fetch(url, session=session) results.extend(parse_list_page(html, base_url)) next_url = find_next_page(html, base_url) if not next_url: break url = next_url time.sleep(random.uniform(0.4, 0.9)) return results def save_csv(rows, path: str): if not rows: print("No rows found.") return fieldnames = list(rows[0].keys()) with open(path, "w", newline="", encoding="utf-8") as f: writer = csv.DictWriter(f, fieldnames=fieldnames) writer.writeheader() writer.writerows(rows) print(f"Saved {len(rows)} rows to {path}") def main(): session = make_cached_session(expire_seconds=3600) rows = scrape_all(BASE_URL, START_URL, session=session, max_pages=20) save_csv(rows, "results.csv") if __name__ == "__main__": main()
8) Practical troubleshooting checklist
- You get blocked (429 or CAPTCHA): slow down, reduce concurrency, rotate IPs only if allowed, and consider contacting the site for an API or permission.
- Selectors break: inspect the page and update
soup.select(...). Avoid super-deep CSS paths. - Data is missing sometimes: treat optional fields as optional (
None), skip malformed cards, and log failures. - Pages differ by locale or A/B test: set consistent headers (e.g.,
Accept-Language) or detect variant layouts.
Where to go next
Once you’re comfortable with this baseline, the usual next steps are:
- Scrape detail pages (follow each
urland extract richer fields) - Add structured logging and metrics (success rate, retries, response times)
- Persist to a database (SQLite/Postgres) instead of CSV
- Use async HTTP (e.g.,
httpx) when you need speed and can be polite
This approach—reliable fetching, defensive parsing, pagination, caching—covers a surprising number of real scraping jobs and gives you a foundation that won’t crumble the moment the network sneezes.
Leave a Reply