Modern Python Web Scraping with Playwright: Handle JavaScript Sites, Waits, and Structured Output

Modern Python Web Scraping with Playwright: Handle JavaScript Sites, Waits, and Structured Output

Classic scrapers that fetch HTML with requests work great for simple pages, but many real sites render content with JavaScript, lazy-load lists, and hide data behind dynamic UI states. In those cases, a browser automation engine is often the most practical “scraping” tool.

In this hands-on guide you’ll build a small, reusable Python scraper using playwright that:

  • Loads JavaScript-rendered pages reliably
  • Waits for the right UI state (not arbitrary sleeps)
  • Extracts structured data via CSS selectors
  • Runs concurrently for speed (without melting the target site)
  • Saves results to JSON and SQLite

Use this only where you have permission to scrape. Respect robots policies, rate limits, and Terms of Service.

1) Setup: Playwright + a Clean Project Layout

Install Playwright and download browser binaries:

python -m venv .venv # macOS/Linux source .venv/bin/activate # Windows # .venv\Scripts\activate pip install playwright playwright install

Suggested structure:

scraper/ scraper.py storage.py models.py output/ results.json

You can keep it in one file for a quick demo, but splitting storage and scraping makes it easier to maintain.

2) The Core Idea: “Wait for State”, Then Extract

Most flaky scrapers fail because they guess timing. Instead of time.sleep(3), wait for a selector that signals the page is ready. Playwright gives you tools like:

  • page.goto(url, wait_until="domcontentloaded") or "networkidle"
  • page.wait_for_selector("css=...")
  • locator = page.locator("css=...") with locator.count(), locator.all_text_contents(), etc.

Let’s implement a practical extractor that can scrape “cards” from a listing page (title, URL, and price-like text). You’ll adapt the selectors to your target site.

3) A Working Scraper (Async) You Can Reuse

This example scrapes a list page, extracts items from repeated card elements, and then optionally visits detail pages. It’s structured so you can swap selectors without rewriting the whole scraper.

from __future__ import annotations import asyncio import json import random from dataclasses import dataclass, asdict from pathlib import Path from typing import Iterable, Optional from playwright.async_api import async_playwright, Browser, Page @dataclass class Item: title: str url: str meta: Optional[str] = None # e.g., price, subtitle, category def jitter(min_ms: int = 150, max_ms: int = 450) -> float: """Small randomized delay to reduce burstiness (seconds).""" return random.randint(min_ms, max_ms) / 1000.0 async def new_page(browser: Browser) -> Page: context = await browser.new_context( user_agent=( "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 " "(KHTML, like Gecko) Chrome/120.0 Safari/537.36" ), viewport={"width": 1280, "height": 800}, ) page = await context.new_page() return page async def scrape_listing(page: Page, url: str) -> list[Item]: await page.goto(url, wait_until="domcontentloaded") # Wait for the page to actually show cards/items. # Replace ".card" with a real selector for your site. await page.wait_for_selector(".card", timeout=15000) cards = page.locator(".card") count = await cards.count() items: list[Item] = [] for i in range(count): card = cards.nth(i) # Replace these selectors to match your target HTML. title = (await card.locator(".title").inner_text()).strip() href = await card.locator("a").get_attribute("href") meta = None meta_locator = card.locator(".meta") if await meta_locator.count() > 0: meta = (await meta_locator.first.inner_text()).strip() if not href: continue # Convert relative URLs to absolute full_url = page.url.rstrip("/") + "/" + href.lstrip("/") if href.startswith("/") else href items.append(Item(title=title, url=full_url, meta=meta)) return items async def scrape_detail(page: Page, item: Item) -> Item: await asyncio.sleep(jitter()) await page.goto(item.url, wait_until="domcontentloaded") # Example: wait for a detail header; adjust selector for your site. await page.wait_for_selector("h1", timeout=15000) # Optionally enrich item with detail data # e.g., grab a longer description: desc_locator = page.locator(".description") if await desc_locator.count() > 0: desc = (await desc_locator.first.inner_text()).strip() item.meta = (item.meta + " | " if item.meta else "") + desc[:120] return item async def bounded_gather(*coros, limit: int = 3): """Run coroutines with a concurrency limit.""" sem = asyncio.Semaphore(limit) async def run(coro): async with sem: return await coro return await asyncio.gather(*(run(c) for c in coros)) async def main(): START_URL = "https://example.com/listing" # change me OUT_JSON = Path("output/results.json") OUT_JSON.parent.mkdir(parents=True, exist_ok=True) async with async_playwright() as p: browser = await p.chromium.launch(headless=True) list_page = await new_page(browser) items = await scrape_listing(list_page, START_URL) # If you don't need detail pages, stop here. # For detail scraping, reuse a small pool of pages: pages = [await new_page(browser) for _ in range(3)] # Distribute work across pages (simple round-robin) tasks = [] for idx, item in enumerate(items): page = pages[idx % len(pages)] tasks.append(scrape_detail(page, item)) enriched = await bounded_gather(*tasks, limit=3) await browser.close() OUT_JSON.write_text(json.dumps([asdict(x) for x in enriched], ensure_ascii=False, indent=2), encoding="utf-8") print(f"Saved {len(enriched)} items to {OUT_JSON}") if __name__ == "__main__": asyncio.run(main())

What you must change: the CSS selectors (.card, .title, .meta) and the START_URL. Open DevTools in your browser, inspect an item card, and choose stable selectors (prefer data attributes like [data-testid="..."] if available).

4) Reliable Waiting Patterns (Avoid Flakiness)

Here are practical rules that keep Playwright scrapers stable:

  • Wait for an element that indicates “data is ready”. Example: await page.wait_for_selector(".card").
  • Don’t wait for “network idle” blindly. Many sites keep connections open; prefer a selector wait.
  • Use Locator APIs. page.locator() is resilient and auto-retries some conditions.
  • Handle empty states. If a selector can be missing, check await locator.count() before reading.

If a site loads more items on scroll, you can combine scrolling with a “count increased” check:

async def scroll_to_load_more(page: Page, item_selector: str, max_rounds: int = 10): last_count = 0 for _ in range(max_rounds): loc = page.locator(item_selector) count = await loc.count() if count == last_count: # Try one more scroll, then stop if still unchanged await page.mouse.wheel(0, 1200) await asyncio.sleep(0.6) if await loc.count() == last_count: break else: last_count = count await page.mouse.wheel(0, 1200) await asyncio.sleep(0.6)

5) Saving to SQLite (So You Can Deduplicate)

JSON is great for exporting, but SQLite is great for “I scrape every day and don’t want duplicates.” Here’s a tiny storage layer using Python’s built-in sqlite3:

import sqlite3 from dataclasses import asdict from typing import Iterable from models import Item # or import Item from your main file def init_db(path: str = "output/items.db") -> None: conn = sqlite3.connect(path) conn.execute( """ CREATE TABLE IF NOT EXISTS items ( url TEXT PRIMARY KEY, title TEXT NOT NULL, meta TEXT ) """ ) conn.commit() conn.close() def upsert_items(items: Iterable[Item], path: str = "output/items.db") -> int: conn = sqlite3.connect(path) cur = conn.cursor() written = 0 for it in items: cur.execute( """ INSERT INTO items(url, title, meta) VALUES(?, ?, ?) ON CONFLICT(url) DO UPDATE SET title=excluded.title, meta=excluded.meta """, (it.url, it.title, it.meta), ) written += 1 conn.commit() conn.close() return written

Now you can call init_db() once, then upsert_items(enriched) after scraping.

6) Practical Tips: Concurrency, Rate Limiting, and “Not Getting Blocked”

  • Limit concurrency. Start with 2–4 concurrent pages. More isn’t always faster if the site throttles.
  • Add jitter. Small random waits avoid robotic request patterns (see jitter()).
  • Prefer headless=true for servers, but debug with headless=False locally to watch what happens.
  • Cache what you can. If detail pages are stable, store them (or store hashes) and skip unchanged pages.
  • Handle timeouts gracefully. Wrap detail scraping in try/except and continue.

A simple “continue on error” wrapper for detail scraping:

async def safe_scrape_detail(page: Page, item: Item) -> Item: try: return await scrape_detail(page, item) except Exception as e: # Log minimal info; keep the pipeline moving item.meta = (item.meta + " | " if item.meta else "") + f"detail_error:{type(e).__name__}" return item

7) Quick Checklist Before You Ship a Scraper

  • Selectors are stable (prefer data-* attributes over fragile class names)
  • Waits are state-based (wait_for_selector), not sleep-based
  • Concurrency is bounded
  • Output is structured (JSON/SQLite) and deduplicated by a stable key (usually URL)
  • Errors don’t kill the whole run

If you adopt this approach, you’ll be able to scrape modern JavaScript-heavy sites with far fewer “works on my machine” moments—and you’ll have a small codebase that junior/mid developers can actually maintain.


Leave a Reply

Your email address will not be published. Required fields are marked *