Python Web Scraping in Practice: Build a Small Sitemap Crawler

Python Web Scraping in Practice: Build a Small Sitemap Crawler

Most web scraping tutorials start by extracting data from one page. Real projects usually need one extra step first: discovering which pages exist. In this hands-on guide, you will build a small Python crawler that starts from one URL, follows internal links, respects basic crawl limits, and exports discovered pages to a CSV file.

This is useful for junior and mid-level developers who need to audit a small site, collect URLs before scraping product pages, or check whether important pages are reachable from the homepage.

What We Will Build

The crawler will:

  • Start from a single URL.
  • Fetch HTML pages with httpx.
  • Parse links with BeautifulSoup.
  • Follow only internal links from the same domain.
  • Avoid crawling the same URL twice.
  • Limit the crawl depth and total page count.
  • Save results to pages.csv.

Install the required packages:

pip install httpx beautifulsoup4

Project Structure

Create a new folder and add one file:

crawler/ └── sitemap_crawler.py

The full crawler will stay in one file so it is easy to run, read, and modify.

Step 1: Normalize and Filter URLs

Before crawling, you need clean URLs. The same page can appear with fragments, trailing slashes, or relative paths. Python’s urllib.parse module helps us resolve and normalize links.

from urllib.parse import urljoin, urlparse, urldefrag def normalize_url(base_url: str, href: str) -> str | None: """ Convert a raw href into an absolute URL and remove fragments. Example: base_url = "https://example.com/docs/" href = "../about#team" result = "https://example.com/about" """ if not href: return None absolute_url = urljoin(base_url, href) clean_url, _fragment = urldefrag(absolute_url) parsed = urlparse(clean_url) if parsed.scheme not in {"http", "https"}: return None return clean_url.rstrip("/")

The function does three practical things. First, it turns relative links into absolute links. Second, it removes fragments such as #pricing, because they usually point to sections of the same page. Third, it rejects unsupported schemes such as mailto:, tel:, and javascript:.

Step 2: Extract Links from HTML

Next, create a function that receives HTML and returns all valid links found in <a href="..."> tags.

from bs4 import BeautifulSoup def extract_links(base_url: str, html: str) -> set[str]: soup = BeautifulSoup(html, "html.parser") links = set() for tag in soup.find_all("a", href=True): normalized = normalize_url(base_url, tag["href"]) if normalized: links.add(normalized) return links

A set is used because the same link may appear multiple times on a page, especially in menus, footers, and call-to-action blocks.

Step 3: Keep the Crawler on One Domain

A safe crawler should not accidentally wander across the web. If you start on your company blog, you probably do not want to crawl GitHub, YouTube, LinkedIn, and every documentation site linked from your posts.

def is_internal_url(start_url: str, candidate_url: str) -> bool: start_domain = urlparse(start_url).netloc candidate_domain = urlparse(candidate_url).netloc return start_domain == candidate_domain

This simple check keeps only URLs from the same domain. For many internal tools, this is enough. In production, you may also want to handle subdomains, such as allowing docs.example.com when starting from www.example.com.

Step 4: Fetch Pages Safely

Now add a function for downloading one page. It should handle network errors, skip non-HTML responses, and avoid crashing the whole crawl because one URL is broken.

import httpx def fetch_html(url: str, timeout: float = 10.0) -> str | None: headers = { "User-Agent": "SmallSitemapCrawler/1.0" } try: response = httpx.get( url, headers=headers, timeout=timeout, follow_redirects=True ) response.raise_for_status() content_type = response.headers.get("content-type", "") if "text/html" not in content_type: return None return response.text except httpx.HTTPError as exc: print(f"Failed: {url} ({exc})") return None

The User-Agent identifies your script. The crawler also follows redirects, which is useful when a site redirects from http to https, or from old URLs to new ones.

Step 5: Build the Crawl Queue

The crawler needs a queue of URLs to visit. Each queue item stores the URL and its depth from the starting page. Depth 0 means the start URL, depth 1 means links found on the start page, and so on.

from collections import deque from dataclasses import dataclass @dataclass class CrawledPage: url: str depth: int title: str | None def get_page_title(html: str) -> str | None: soup = BeautifulSoup(html, "html.parser") if soup.title and soup.title.string: return soup.title.string.strip() return None def crawl_site( start_url: str, max_pages: int = 50, max_depth: int = 2 ) -> list[CrawledPage]: start_url = start_url.rstrip("/") visited = set() results = [] queue = deque() queue.append((start_url, 0)) while queue and len(results) < max_pages: current_url, depth = queue.popleft() if current_url in visited: continue if depth > max_depth: continue visited.add(current_url) print(f"Crawling depth={depth}: {current_url}") html = fetch_html(current_url) if html is None: continue title = get_page_title(html) results.append( CrawledPage( url=current_url, depth=depth, title=title ) ) links = extract_links(current_url, html) for link in sorted(links): if link not in visited and is_internal_url(start_url, link): queue.append((link, depth + 1)) return results

This function uses breadth-first crawling. That means it visits pages close to the homepage before going deeper. For site audits, this is usually better than diving deeply into one branch of the site immediately.

Step 6: Export Results to CSV

Once the crawl is complete, write the results to a CSV file. This makes the output easy to open in Excel, Google Sheets, or another script.

import csv def save_to_csv(pages: list[CrawledPage], filename: str = "pages.csv") -> None: with open(filename, "w", newline="", encoding="utf-8") as file: writer = csv.DictWriter( file, fieldnames=["url", "depth", "title"] ) writer.writeheader() for page in pages: writer.writerow({ "url": page.url, "depth": page.depth, "title": page.title or "" })

Step 7: Add a Command-Line Entry Point

Finally, add a simple main block so the crawler can run from the terminal.

if __name__ == "__main__": START_URL = "https://example.com" pages = crawl_site( start_url=START_URL, max_pages=50, max_depth=2 ) save_to_csv(pages) print(f"Saved {len(pages)} pages to pages.csv")

Replace https://example.com with a real site you are allowed to crawl, then run:

python sitemap_crawler.py

After the script finishes, you should see a pages.csv file containing discovered URLs, crawl depth, and page titles.

Complete Code

import csv from collections import deque from dataclasses import dataclass from urllib.parse import urljoin, urlparse, urldefrag import httpx from bs4 import BeautifulSoup @dataclass class CrawledPage: url: str depth: int title: str | None def normalize_url(base_url: str, href: str) -> str | None: if not href: return None absolute_url = urljoin(base_url, href) clean_url, _fragment = urldefrag(absolute_url) parsed = urlparse(clean_url) if parsed.scheme not in {"http", "https"}: return None return clean_url.rstrip("/") def extract_links(base_url: str, html: str) -> set[str]: soup = BeautifulSoup(html, "html.parser") links = set() for tag in soup.find_all("a", href=True): normalized = normalize_url(base_url, tag["href"]) if normalized: links.add(normalized) return links def is_internal_url(start_url: str, candidate_url: str) -> bool: return urlparse(start_url).netloc == urlparse(candidate_url).netloc def fetch_html(url: str, timeout: float = 10.0) -> str | None: headers = { "User-Agent": "SmallSitemapCrawler/1.0" } try: response = httpx.get( url, headers=headers, timeout=timeout, follow_redirects=True ) response.raise_for_status() content_type = response.headers.get("content-type", "") if "text/html" not in content_type: return None return response.text except httpx.HTTPError as exc: print(f"Failed: {url} ({exc})") return None def get_page_title(html: str) -> str | None: soup = BeautifulSoup(html, "html.parser") if soup.title and soup.title.string: return soup.title.string.strip() return None def crawl_site( start_url: str, max_pages: int = 50, max_depth: int = 2 ) -> list[CrawledPage]: start_url = start_url.rstrip("/") visited = set() results = [] queue = deque([(start_url, 0)]) while queue and len(results) < max_pages: current_url, depth = queue.popleft() if current_url in visited: continue if depth > max_depth: continue visited.add(current_url) print(f"Crawling depth={depth}: {current_url}") html = fetch_html(current_url) if html is None: continue results.append( CrawledPage( url=current_url, depth=depth, title=get_page_title(html) ) ) links = extract_links(current_url, html) for link in sorted(links): if link not in visited and is_internal_url(start_url, link): queue.append((link, depth + 1)) return results def save_to_csv(pages: list[CrawledPage], filename: str = "pages.csv") -> None: with open(filename, "w", newline="", encoding="utf-8") as file: writer = csv.DictWriter( file, fieldnames=["url", "depth", "title"] ) writer.writeheader() for page in pages: writer.writerow({ "url": page.url, "depth": page.depth, "title": page.title or "" }) if __name__ == "__main__": START_URL = "https://example.com" pages = crawl_site( start_url=START_URL, max_pages=50, max_depth=2 ) save_to_csv(pages) print(f"Saved {len(pages)} pages to pages.csv")

Practical Improvements

This crawler is intentionally small, but it gives you a strong base. Here are practical improvements you can add next:

  • Add a delay between requests with time.sleep() to reduce server load.
  • Read and respect robots.txt before crawling a site.
  • Store HTTP status codes, canonical URLs, and meta descriptions.
  • Skip URLs containing filters, tracking parameters, or logout paths.
  • Use async requests with httpx.AsyncClient for controlled parallel crawling.

Conclusion

A good scraper often starts with good URL discovery. This small Python crawler teaches the core pattern: fetch a page, parse links, filter URLs, track visited pages, and export clean results. Before using it on any website, confirm that crawling is allowed, keep request volume low, and avoid collecting data you do not need. From here, you can extend the same foundation into an SEO audit tool, broken-link checker, documentation crawler, or targeted web scraping pipeline.


Leave a Reply

Your email address will not be published. Required fields are marked *