Python Web Scraping in Practice: Build a Polite Price Tracker with httpx, BeautifulSoup, and SQLite
Python web scraping is most useful when it solves a small, repeatable problem: collect public page data, normalize it, store it, and compare changes over time. In this tutorial, you will build a practical scraper that fetches product-like pages, extracts a title and price, saves results to SQLite, and avoids common beginner mistakes such as hammering servers or writing fragile parsing code.
The goal is not to scrape a specific real shop. Instead, we will create a reusable pattern you can adapt to pages you are allowed to crawl. Before scraping any site, check its terms of service, robots.txt, and rate limits. Do not scrape private data, login-only pages, or content you do not have permission to collect.
What We Will Build
Our scraper will:
- Read a list of URLs from a Python list.
- Fetch each page with
httpx. - Parse HTML using
BeautifulSoup. - Extract a page title and price.
- Store each scrape result in SQLite.
- Handle errors, timeouts, and missing elements safely.
Install the dependencies first:
pip install httpx beautifulsoup4 lxml
Project Structure
Keep the project small and clear:
price-tracker/ scraper.py products.db
The database file will be created automatically. All the code below can live in scraper.py.
Create the Database Layer
SQLite is ideal for a junior or mid-level scraping project because it requires no server setup and is included with Python. We will create one table called price_snapshots. Each run inserts a new row, which means you can track price history instead of overwriting old data.
import sqlite3 from datetime import datetime, timezone DB_PATH = "products.db" def init_db(): with sqlite3.connect(DB_PATH) as conn: conn.execute(""" CREATE TABLE IF NOT EXISTS price_snapshots ( id INTEGER PRIMARY KEY AUTOINCREMENT, url TEXT NOT NULL, title TEXT, price_cents INTEGER, currency TEXT, scraped_at TEXT NOT NULL ) """) conn.commit() def save_snapshot(url, title, price_cents, currency): scraped_at = datetime.now(timezone.utc).isoformat() with sqlite3.connect(DB_PATH) as conn: conn.execute( """ INSERT INTO price_snapshots (url, title, price_cents, currency, scraped_at) VALUES (?, ?, ?, ?, ?) """, (url, title, price_cents, currency, scraped_at) ) conn.commit()
Notice the use of parameterized SQL with ? placeholders. This is safer than formatting SQL strings manually and keeps your code ready for real-world inputs.
Fetch HTML Politely
A scraper should behave like a careful client. Set a timeout, identify your scraper with a clear User-Agent, and do not retry aggressively.
import httpx import time HEADERS = { "User-Agent": "LearningPriceTracker/1.0 (contact: [email protected])" } def fetch_html(url): try: response = httpx.get( url, headers=HEADERS, timeout=10.0, follow_redirects=True ) response.raise_for_status() return response.text except httpx.TimeoutException: print(f"Timeout while fetching {url}") except httpx.HTTPStatusError as exc: print(f"HTTP error {exc.response.status_code} for {url}") except httpx.RequestError as exc: print(f"Network error for {url}: {exc}") return None
The function returns None when something fails. This keeps the rest of the scraper simple: if there is no HTML, skip parsing and move to the next URL.
Parse Title and Price
HTML varies from site to site, so selectors are the most fragile part of scraping. In a real project, inspect the target page and choose stable CSS selectors. Avoid selectors that look auto-generated, such as .css-1x92abc, because they often change during deployments.
For this example, assume the page has a product title in h1 and a price in an element like <span class="price">$19.99</span>.
from bs4 import BeautifulSoup import re def parse_price_to_cents(raw_price): if not raw_price: return None, None text = raw_price.strip() currency = None if "$" in text: currency = "USD" elif "€" in text: currency = "EUR" elif "£" in text: currency = "GBP" cleaned = re.sub(r"[^0-9.,]", "", text) if not cleaned: return None, currency # Handle simple prices such as 19.99 or 19,99 normalized = cleaned.replace(",", ".") try: price_float = float(normalized) return int(round(price_float * 100)), currency except ValueError: return None, currency def parse_product(html): soup = BeautifulSoup(html, "lxml") title_el = soup.select_one("h1") price_el = soup.select_one(".price") title = title_el.get_text(strip=True) if title_el else None raw_price = price_el.get_text(strip=True) if price_el else None price_cents, currency = parse_price_to_cents(raw_price) return { "title": title, "price_cents": price_cents, "currency": currency }
The parser is intentionally defensive. A missing title or price should not crash the entire scraping run. Instead, store what you can, log what failed, and improve your selectors later.
Run the Scraper
Now connect the database, fetcher, parser, and saving logic. Add a short delay between requests to reduce load on the target server.
URLS = [ "https://example.com/product/example-one", "https://example.com/product/example-two", ] def scrape_all(): init_db() for url in URLS: print(f"Scraping {url}") html = fetch_html(url) if html is None: continue product = parse_product(html) save_snapshot( url=url, title=product["title"], price_cents=product["price_cents"], currency=product["currency"] ) print( "Saved:", product["title"], product["price_cents"], product["currency"] ) time.sleep(2) if __name__ == "__main__": scrape_all()
Run it:
python scraper.py
After a successful run, inspect the database:
sqlite3 products.db SELECT url, title, price_cents, currency, scraped_at FROM price_snapshots ORDER BY scraped_at DESC;
Make Selectors Easier to Change
Hardcoding selectors inside parse_product() is fine for a tutorial, but it becomes painful when scraping multiple page types. A simple improvement is to move selectors into a dictionary.
SELECTORS = { "title": "h1", "price": ".price" } def parse_product(html): soup = BeautifulSoup(html, "lxml") title_el = soup.select_one(SELECTORS["title"]) price_el = soup.select_one(SELECTORS["price"]) title = title_el.get_text(strip=True) if title_el else None raw_price = price_el.get_text(strip=True) if price_el else None price_cents, currency = parse_price_to_cents(raw_price) return { "title": title, "price_cents": price_cents, "currency": currency }
This small change makes maintenance easier. When the page layout changes, you update the selector configuration instead of digging through parsing logic.
Add Basic Change Detection
Storing snapshots is useful, but developers usually want to know when something changed. Add a helper that fetches the latest saved price before inserting a new one.
def get_latest_price(url): with sqlite3.connect(DB_PATH) as conn: row = conn.execute( """ SELECT price_cents, currency FROM price_snapshots WHERE url = ? ORDER BY scraped_at DESC LIMIT 1 """, (url,) ).fetchone() if row is None: return None return { "price_cents": row[0], "currency": row[1] }
Then update the scraping loop:
latest = get_latest_price(url) if latest and latest["price_cents"] != product["price_cents"]: old_price = latest["price_cents"] / 100 new_price = product["price_cents"] / 100 print(f"Price changed for {url}: {old_price} -> {new_price}")
Place this block before save_snapshot(). In a production app, you could send an email, Slack message, or webhook instead of printing to the console.
Common Scraping Mistakes to Avoid
- No timeout: without a timeout, one slow request can freeze the whole script.
- No delay: fast loops can overload small websites and get your IP blocked.
- Fragile selectors: prefer semantic attributes, headings, stable classes, or structured data where available.
- Ignoring failures: log failed URLs so you can debug them later.
- Overwriting data: snapshots are better than single-row updates when you care about history.
When to Use a Browser Automation Tool Instead
httpx and BeautifulSoup work well when the HTML you need is present in the server response. Some websites render important content with JavaScript after the page loads. In that case, a browser automation tool such as Playwright or Selenium may be more appropriate.
Before switching tools, inspect the page source and network requests. Sometimes the data comes from a JSON endpoint that is easier and more reliable to request directly. Browser automation is powerful, but it is slower and more operationally complex than plain HTTP scraping.
Conclusion
A good Python web scraping project is not just about extracting text from HTML. It needs polite request handling, defensive parsing, clear storage, and maintainable selectors. The pattern in this article gives you a practical base: fetch with httpx, parse with BeautifulSoup, store snapshots in SQLite, and compare new results with previous data. From here, you can add scheduling with cron, notifications, CSV exports, or per-site selector configurations.
Leave a Reply