Python Web Scraping in Practice: Extract Product Data with Playwright and BeautifulSoup

Python Web Scraping in Practice: Extract Product Data with Playwright and BeautifulSoup

Many junior developers start web scraping with requests and BeautifulSoup. That works well for static HTML, but many modern websites render product cards, prices, tables, and search results with JavaScript. In those cases, your scraper may fetch an almost empty page.

This article shows a practical approach: use Playwright to load JavaScript-rendered pages, then use BeautifulSoup to parse the final HTML cleanly. We will build a small scraper that extracts product names, prices, and links, then saves them to a CSV file.

What We Are Building

We will create a scraper that:

  • opens a product listing page with a real browser engine,
  • waits until product cards are available,
  • parses the rendered HTML with BeautifulSoup,
  • extracts structured data,
  • writes the results to products.csv,
  • handles missing fields safely.

For the example, assume the page contains product cards like this:

<article class="product-card"> <a class="product-card__link" href="/products/wireless-mouse"> <h2 class="product-card__title">Wireless Mouse</h2> </a> <span class="product-card__price">$24.99</span> </article>

Real websites will have different class names. The important skill is learning how to inspect the HTML and map selectors to the data you need.

Project Setup

Create a new folder and install the dependencies:

mkdir product-scraper cd product-scraper python -m venv venv source venv/bin/activate # macOS/Linux # venv\Scripts\activate # Windows pip install playwright beautifulsoup4 lxml playwright install

The packages do different jobs:

  • playwright controls the browser and renders JavaScript.
  • beautifulsoup4 parses HTML.
  • lxml gives BeautifulSoup a fast parser.

Step 1: Load a JavaScript Page with Playwright

Create a file named scraper.py:

from playwright.sync_api import sync_playwright def fetch_rendered_html(url: str) -> str: with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page( user_agent=( "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " "AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/124.0 Safari/537.36" ) ) page.goto(url, wait_until="networkidle", timeout=30000) # Wait for product cards to appear. page.wait_for_selector(".product-card", timeout=10000) html = page.content() browser.close() return html if __name__ == "__main__": html = fetch_rendered_html("https://example.com/products") print(html[:1000])

The key line is page.content(). It returns the HTML after the browser has loaded and executed JavaScript. This is often the difference between an empty scrape and a successful one.

wait_until="networkidle" waits until network activity slows down. wait_for_selector(".product-card") adds a more specific check: it waits for the content we actually care about.

Step 2: Parse Product Cards with BeautifulSoup

Now add a parser function:

from bs4 import BeautifulSoup from urllib.parse import urljoin def clean_text(value: str | None) -> str: if not value: return "" return " ".join(value.split()) def parse_products(html: str, base_url: str) -> list[dict]: soup = BeautifulSoup(html, "lxml") cards = soup.select(".product-card") products = [] for card in cards: title_element = card.select_one(".product-card__title") price_element = card.select_one(".product-card__price") link_element = card.select_one(".product-card__link") title = clean_text(title_element.get_text() if title_element else None) price = clean_text(price_element.get_text() if price_element else None) relative_url = link_element.get("href") if link_element else "" product_url = urljoin(base_url, relative_url) if not title: continue products.append({ "title": title, "price": price, "url": product_url, }) return products

This parser is defensive. It checks whether each element exists before calling methods on it. That matters because production HTML is messy. Some cards may have missing prices, promotional labels, lazy-loaded content, or slightly different markup.

The urljoin() function converts relative URLs like /products/wireless-mouse into absolute URLs such as https://example.com/products/wireless-mouse.

Step 3: Save Results to CSV

Python’s standard library already includes CSV support, so we do not need another dependency.

import csv def save_products_to_csv(products: list[dict], filename: str) -> None: fieldnames = ["title", "price", "url"] with open(filename, mode="w", newline="", encoding="utf-8") as file: writer = csv.DictWriter(file, fieldnames=fieldnames) writer.writeheader() writer.writerows(products)

Now combine everything into one working script:

import csv from urllib.parse import urljoin from bs4 import BeautifulSoup from playwright.sync_api import sync_playwright def fetch_rendered_html(url: str) -> str: with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page( user_agent=( "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " "AppleWebKit/537.36 (KHTML, like Gecko) " "Chrome/124.0 Safari/537.36" ) ) page.goto(url, wait_until="networkidle", timeout=30000) page.wait_for_selector(".product-card", timeout=10000) html = page.content() browser.close() return html def clean_text(value: str | None) -> str: if not value: return "" return " ".join(value.split()) def parse_products(html: str, base_url: str) -> list[dict]: soup = BeautifulSoup(html, "lxml") cards = soup.select(".product-card") products = [] for card in cards: title_element = card.select_one(".product-card__title") price_element = card.select_one(".product-card__price") link_element = card.select_one(".product-card__link") title = clean_text(title_element.get_text() if title_element else None) price = clean_text(price_element.get_text() if price_element else None) relative_url = link_element.get("href") if link_element else "" product_url = urljoin(base_url, relative_url) if not title: continue products.append({ "title": title, "price": price, "url": product_url, }) return products def save_products_to_csv(products: list[dict], filename: str) -> None: fieldnames = ["title", "price", "url"] with open(filename, mode="w", newline="", encoding="utf-8") as file: writer = csv.DictWriter(file, fieldnames=fieldnames) writer.writeheader() writer.writerows(products) def main() -> None: url = "https://example.com/products" html = fetch_rendered_html(url) products = parse_products(html, base_url=url) save_products_to_csv(products, "products.csv") print(f"Saved {len(products)} products to products.csv") if __name__ == "__main__": main()

Step 4: Add Pagination Support

Most useful scraping tasks involve more than one page. A simple pattern is to scrape page URLs in a loop:

def build_page_url(base_url: str, page_number: int) -> str: return f"{base_url}?page={page_number}" def scrape_multiple_pages(base_url: str, total_pages: int) -> list[dict]: all_products = [] for page_number in range(1, total_pages + 1): page_url = build_page_url(base_url, page_number) print(f"Scraping {page_url}") html = fetch_rendered_html(page_url) products = parse_products(html, base_url=page_url) all_products.extend(products) return all_products

Then update main():

def main() -> None: base_url = "https://example.com/products" products = scrape_multiple_pages(base_url, total_pages=5) save_products_to_csv(products, "products.csv") print(f"Saved {len(products)} products to products.csv")

This approach is easy to understand, but it opens and closes a browser for every page because fetch_rendered_html() manages the browser internally. For small scraping jobs, that is fine. For larger jobs, you should reuse the same browser instance.

Step 5: Reuse the Browser for Better Performance

Here is a more efficient version that keeps one browser open while visiting multiple pages:

def scrape_pages_with_single_browser(base_url: str, total_pages: int) -> list[dict]: all_products = [] with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page() for page_number in range(1, total_pages + 1): page_url = f"{base_url}?page={page_number}" print(f"Scraping {page_url}") page.goto(page_url, wait_until="networkidle", timeout=30000) page.wait_for_selector(".product-card", timeout=10000) html = page.content() products = parse_products(html, base_url=page_url) all_products.extend(products) browser.close() return all_products

This version is faster because browser startup is expensive. Starting Chromium once and reusing a page is usually enough for simple pagination.

Handling Common Scraping Problems

Web scraping becomes easier when you plan for failure. Here are common issues and practical fixes:

  • Selector not found: The page structure may have changed. Inspect the page and update selectors like .product-card.
  • Timeouts: Increase the timeout, or wait for a more reliable selector.
  • Empty results: Confirm that the content is visible in page.content(). Some data may come from an API instead of HTML.
  • Relative links: Use urljoin() instead of manually concatenating strings.
  • Duplicate rows: Deduplicate by product URL before saving.

A simple deduplication helper looks like this:

def deduplicate_products(products: list[dict]) -> list[dict]: seen_urls = set() unique_products = [] for product in products: url = product["url"] if url in seen_urls: continue seen_urls.add(url) unique_products.append(product) return unique_products

Use it before writing the CSV:

products = deduplicate_products(products) save_products_to_csv(products, "products.csv")

Respect Robots, Rate Limits, and Terms

A scraper should not behave like a denial-of-service tool. Even when you are writing a small script, keep it polite:

  • Check the website’s terms of use before scraping.
  • Do not scrape private, paywalled, or logged-in areas without permission.
  • Add delays between requests for larger jobs.
  • Cache results while developing so you do not reload the same page repeatedly.
  • Prefer official APIs when they are available.

You can add a simple delay between pages:

import time for page_number in range(1, total_pages + 1): # scrape page here time.sleep(2)

When Not to Use Playwright

Playwright is powerful, but it is heavier than plain HTTP. Use requests when the HTML is already present in the server response. Use Playwright when:

  • the page depends on JavaScript rendering,
  • content appears after scrolling or clicking,
  • you need to interact with filters, forms, or buttons,
  • the site behaves differently from a basic HTTP request.

A good workflow is to start simple. Try requests first. If the content is missing, move to Playwright.

Conclusion

Python web scraping is most reliable when you separate browser automation from HTML parsing. Playwright handles JavaScript and page loading. BeautifulSoup handles extraction. CSV writing keeps the result usable for spreadsheets, data checks, or later imports.

The important habit is not memorizing selectors. It is building scrapers that are readable, defensive, and easy to adjust when the target page changes. Start with one page, validate the extracted data, add pagination, then improve performance only when the basic scraper is correct.


Leave a Reply

Your email address will not be published. Required fields are marked *