Python Web Scraping for JavaScript-Heavy Sites with Playwright (Async, Robust, and Practical)

Python Web Scraping for JavaScript-Heavy Sites with Playwright (Async, Robust, and Practical)

Many modern sites render content with JavaScript, meaning a simple requests.get() often returns an empty shell (or placeholders) instead of the data you want. In those cases, you need a real browser engine—without turning your scraper into a fragile mess.

In this hands-on guide, you’ll build a small but production-minded scraper using Playwright and asyncio. You’ll learn how to:

  • Render JavaScript content reliably
  • Extract structured data with stable selectors
  • Paginate safely
  • Throttle, retry, and avoid common “it worked yesterday” failures
  • Export results to CSV/JSON

Use-case example: scrape product cards from a listing page that loads via JS and spans multiple pages.

Prerequisites and Setup

You’ll need Python 3.10+ and a virtual environment.

python -m venv .venv # macOS/Linux source .venv/bin/activate # Windows # .venv\Scripts\activate pip install playwright python -m playwright install

That last command installs browser binaries Playwright needs.

Pick a Target and Inspect the Page

Open your target page in the browser, right-click a product card (or any repeated item), and inspect it. Look for stable attributes like:

  • data-testid (best)
  • data-* attributes
  • semantic structure (e.g., an article with a heading)

Avoid selectors that depend on random class names like .xYz_123. They’ll break.

In the examples below, assume each item is a card like:

  • Card root: [data-testid="product-card"]
  • Name: [data-testid="product-name"]
  • Price: [data-testid="product-price"]
  • Next page button: [aria-label="Next"]

If your site uses different selectors, swap them accordingly.

A Minimal “Render + Extract” Scraper

This first script loads the page, waits for cards to appear, extracts fields, and prints them. It’s the core loop you’ll reuse.

import asyncio from playwright.async_api import async_playwright URL = "https://example.com/products" # replace with your target async def main(): async with async_playwright() as p: browser = await p.chromium.launch(headless=True) page = await browser.new_page() # A realistic User-Agent helps avoid trivial bot filters. await page.set_extra_http_headers({ "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 " "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" }) await page.goto(URL, wait_until="domcontentloaded") # Wait for JS-rendered content. await page.wait_for_selector('[data-testid="product-card"]', timeout=15000) cards = await page.query_selector_all('[data-testid="product-card"]') for card in cards: name = await card.query_selector('[data-testid="product-name"]') price = await card.query_selector('[data-testid="product-price"]') name_text = (await name.inner_text()) if name else "" price_text = (await price.inner_text()) if price else "" print({"name": name_text.strip(), "price": price_text.strip()}) await browser.close() if __name__ == "__main__": asyncio.run(main())

If this prints items consistently, you’ve validated: selectors, rendering, and wait strategy.

Make It Practical: Pagination + Export + Defensive Extraction

Now let’s turn this into something you can run on a schedule. We’ll add:

  • Pagination (click “Next” until it’s disabled or missing)
  • Graceful field extraction (avoid crashes from missing nodes)
  • Deduplication by URL or name
  • CSV + JSON export
import asyncio import csv import json from dataclasses import dataclass, asdict from typing import List, Optional, Set from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeoutError START_URL = "https://example.com/products" # replace CARD = '[data-testid="product-card"]' NAME = '[data-testid="product-name"]' PRICE = '[data-testid="product-price"]' LINK = 'a[data-testid="product-link"]' NEXT = '[aria-label="Next"]' @dataclass class Product: name: str price: str url: str async def safe_text(el) -> str: if not el: return "" txt = await el.inner_text() return txt.strip() async def safe_attr(el, attr: str) -> str: if not el: return "" val = await el.get_attribute(attr) return (val or "").strip() async def scrape_listing(max_pages: int = 10) -> List[Product]: results: List[Product] = [] seen: Set[str] = set() async with async_playwright() as p: browser = await p.chromium.launch(headless=True) context = await browser.new_context( user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 " "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" ) page = await context.new_page() await page.goto(START_URL, wait_until="domcontentloaded") await page.wait_for_selector(CARD, timeout=15000) current_page = 1 while True: # Ensure cards are present on each page (especially after pagination) try: await page.wait_for_selector(CARD, timeout=15000) except PlaywrightTimeoutError: break cards = await page.query_selector_all(CARD) for card in cards: name_el = await card.query_selector(NAME) price_el = await card.query_selector(PRICE) link_el = await card.query_selector(LINK) name = await safe_text(name_el) price = await safe_text(price_el) href = await safe_attr(link_el, "href") # Normalize relative URLs if href and href.startswith("/"): href = page.url.split("/", 3)[0] + "//" + page.url.split("/", 3)[2] + href # Choose a stable dedupe key. URL is best if available. key = href or name if not key or key in seen: continue seen.add(key) results.append(Product(name=name, price=price, url=href)) if current_page >= max_pages: break # Try to paginate next_btn = await page.query_selector(NEXT) if not next_btn: break # If the button is disabled, stop (common pattern) disabled = await next_btn.get_attribute("disabled") aria_disabled = await next_btn.get_attribute("aria-disabled") if disabled is not None or aria_disabled == "true": break # Click next and wait for navigation / content change await next_btn.click() # A practical wait: ensure either the URL changes or content changes. # If the site keeps URL the same, at least wait a bit and re-wait for cards. await page.wait_for_timeout(500) current_page += 1 await context.close() await browser.close() return results def export_csv(products: List[Product], path: str) -> None: with open(path, "w", newline="", encoding="utf-8") as f: writer = csv.DictWriter(f, fieldnames=["name", "price", "url"]) writer.writeheader() for p in products: writer.writerow(asdict(p)) def export_json(products: List[Product], path: str) -> None: with open(path, "w", encoding="utf-8") as f: json.dump([asdict(p) for p in products], f, ensure_ascii=False, indent=2) async def main(): products = await scrape_listing(max_pages=5) print(f"Scraped {len(products)} products") export_csv(products, "products.csv") export_json(products, "products.json") if __name__ == "__main__": asyncio.run(main())

This is intentionally “boring and reliable”: stable selectors, safe extraction, simple pagination, and exports you can hand to anyone.

Handling Slow Loads, Spinners, and “Content Appears Then Changes”

JS pages commonly show skeleton cards, then replace them. Two practical techniques:

  • Wait for a “real content” signal: e.g. price not empty, or a “loaded” marker disappears.
  • Wait for network to settle (carefully): wait_until="networkidle" can hang on sites with constant polling.

Example: wait until at least one card has a non-empty price:

# After waiting for the card selector: await page.wait_for_function( """() => { const cards = document.querySelectorAll('[data-testid="product-card"]'); if (!cards.length) return false; const price = cards[0].querySelector('[data-testid="product-price"]'); return price && price.textContent.trim().length > 0; }""", timeout=15000 )

This reduces the “I scraped empty values” problem.

Retries and Throttling: Don’t Get Your IP Burned

Two rules keep you out of trouble:

  • Go slower than you think. Add small delays and don’t open 50 pages at once.
  • Retry failures with backoff. Transient timeouts happen.

Here’s a tiny async retry helper you can use around navigation steps:

import asyncio from typing import Callable, Awaitable, TypeVar T = TypeVar("T") async def retry(fn: Callable[[], Awaitable[T]], attempts: int = 3, base_delay: float = 0.5) -> T: last_exc = None for i in range(attempts): try: return await fn() except Exception as e: last_exc = e await asyncio.sleep(base_delay * (2 ** i)) raise last_exc

Use it like:

await retry(lambda: page.goto(START_URL, wait_until="domcontentloaded"))

Also consider a tiny per-page delay between pagination clicks:

await page.wait_for_timeout(800) # milliseconds

Common Gotchas (and Quick Fixes)

  • Cookie banners block clicks: detect and click “Accept” once, early. Use a selector like button:has-text("Accept").
  • Infinite scroll: scroll until item count stops increasing, then extract.
  • Fragile selectors: prefer data-testid, aria-label, headings, and stable structure.
  • Headless blocked: try headful mode (headless=False) to debug, then consider stealth strategies only if permitted.
  • Legal/ToS issues: always check site terms and robots guidance for your use case.

Where to Take It Next

Once you have a reliable Playwright baseline, you can level up without rewriting everything:

  • Store results in SQLite/Postgres instead of CSV
  • Add structured logging (logging module) and metrics
  • Run in a container or scheduled job (cron/GitHub Actions)
  • Extract detail pages by visiting each product URL (with rate limits)

If you can reliably render the page, wait for meaningful signals, and extract with stable selectors, you’re already ahead of most “quick scrape” scripts. Playwright gives you a real browser—your job is to keep the scraper disciplined, predictable, and gentle on the target site.


Leave a Reply

Your email address will not be published. Required fields are marked *