Skip to main content

Command Palette

Search for a command to run...

Defensive Scraping: Writing Resilient Selectors That Survive Site Redesigns

Published
6 min read
Defensive Scraping: Writing Resilient Selectors That Survive Site Redesigns
R

Technical Talk for Developers

It’s Monday morning. You sit down with your coffee, check your logs, and see a wall of red. Your primary data pipeline, which has been running smoothly for months, just crashed with a NoneType has no attribute 'text' error. You inspect the target website only to find that the developers swapped a <div> for a <section> or added a marketing banner that shifted every element’s index.

This is the "brittle scraper" trap. Most developers write extraction logic based on where an element sits in the layout rather than what the element represents. In defensive scraping, we flip this script. We write selectors that are decoupled from the visual layout and anchored to stable, semantic identifiers.

This guide covers strategies to build resilient scrapers that survive site updates, A/B tests, and framework-driven DOM changes.

The Hierarchy Trap: Why Scrapers Break

The most common reason scrapers fail is "DOM Coupling." This occurs when code relies on the exact path of HTML tags to find data. If you have ever used the "Copy XPath" feature in Chrome DevTools and ended up with something like /html/body/div[2]/div/div[3]/span[1], you have created a ticking time bomb.

Modern frontend development has made this problem worse in three ways:

  1. Dynamic Class Names: Frameworks like Tailwind CSS or styled-components often generate classes like css-1q2w3 or mt-4. These can change every time the site is redeployed.

  2. A/B Testing: Sites often serve different versions of a page to different users. A small layout change for a "Buy Now" button test can break an index-based selector.

  3. Shadow DOM and Nesting: Deeply nested structures are prone to shifting. Adding a single wrapper div for an analytics script can break a global XPath.

To build a resilient scraper, move away from "DOM coordinates" and toward "Semantic Anchors."

Strategy 1: Prioritize Semantic Locators

The most stable parts of a webpage are the attributes meant for machines or accessibility. Developers rarely change these, as doing so would break their own internal tests or make the site unusable for screen readers.

When selecting elements, follow this hierarchy of reliability:

  1. Test IDs: Attributes like data-testid or data-cy are added specifically for automated testing. They are the gold standard for scraping.

  2. Accessibility Labels: Attributes like aria-label or role="button" are highly stable.

  3. Machine Names: The name attribute on inputs or id attributes, provided they aren't auto-generated.

Using Substring Matching

If IDs or classes are partially dynamic (e.g., user_123, user_456), avoid matching the whole string. Instead, use CSS substring selectors:

# Selects any element where the ID starts with 'user_'
selector = "div[id^='user_']"

# Selects any element where the class contains 'product-card' 
# even if other dynamic classes are present
selector = "div[class*='product-card']"

Strategy 2: Text and Visual Matching

If a site lacks clean data attributes, the next best option is to look at the page like a user does. A "Login" button will almost always contain the text "Login," regardless of whether it is a <button>, an <a>, or a <div>.

Using Playwright’s locator API, you can target elements by their visible text, which is often more stable than the underlying HTML structure.

# Brittle: Targeting by class
page.locator(".btn-v2-blue").click()

# Robust: Targeting by the 'human' view
page.get_by_role("button", name="Add to Cart").click()

While text selection is powerful, be cautious with multi-language sites. If the site detects your IP and switches to Spanish, a search for "Add to Cart" will fail. In these cases, combine text matching with language-agnostic attributes.

Strategy 3: The Anchor Point Strategy

The "Anchor Point" strategy involves finding a highly stable element (the Anchor) and navigating the DOM relative to it. This is particularly useful for scraping "Key-Value" pairs in tables or description lists.

Imagine a product specification table where the order of rows might change:

SpecValue
Weight1.2kg
ColorSpace Gray

Instead of requesting the second row, find the text "Weight" (the Anchor) and then find its sibling.

Implementation with Playwright

Use the :has() pseudo-class or filter by inner locators to create a local scope.

# Find the row that contains the text "Weight" 
# then grab the second cell in that specific row
weight_value = page.locator("tr").filter(has_text="Weight").locator("td").nth(1)

By using .filter(), you create a local scope. Even if the developers add five new rows to the table, the code will still find the "Weight" row correctly.

Practical Example: Refactoring Brittle to Robust

Consider this simplified HTML for a product page:

<div class="header_container">
    <div class="banner">Sale!</div>
    <section class="p-4 css-x92k">
        <h1 class="title">Wireless Headphones</h1>
        <div class="price-wrapper">
            <span class="old-price">$199</span>
            <span class="current-price">$149</span>
        </div>
        <button class="btn-primary-99">Buy Now</button>
    </section>
</div>

The Brittle Approach (Beautiful Soup)

This script relies on the exact index of the span and a specific class name that looks auto-generated.

from bs4 import BeautifulSoup

html = "...(above html)..."
soup = BeautifulSoup(html, 'html.parser')

# This fails if another span is added or if 'current-price' is renamed
price = soup.find_all('span')[1].text
print(f"Price: {price}")

The Robust Approach (Playwright)

Refactoring this with defensive strategies involves finding the price relative to the "current-price" identifier or filtering the container.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://example-store.com/product")

    # Strategy: Find the price wrapper, then the current price via substring
    price_locator = page.locator("[class*='price-wrapper']").locator("[class*='current-price']")

    price = price_locator.inner_text()
    print(f"Extraction Successful: {price}")
    browser.close()

Targeting the price-wrapper first limits the search area. Even if the rest of the page layout changes, the local search remains valid.

Failing Gracefully: Defensive Error Handling

Even the best selectors can fail if a site undergoes a total overhaul. Defensive scraping is about making those failures easier to fix.

Instead of letting a script crash with a generic error, implement "Soft Asserts." If a selector fails, capture the state of the page for instant debugging.

import logging

def safe_extract(locator, name):
    try:
        return locator.inner_text(timeout=5000)
    except Exception:
        # Log the HTML of the failed element to see what changed
        snippet = locator.evaluate("el => el.outerHTML")
        logging.error(f"Failed to find {name}. HTML snippet: {snippet}")
        return None

Logging the outerHTML of the parent element when a child fails is a lifesaver. It shows exactly what changed in the DOM without requiring you to manually reproduce the session in a browser.

Wrap Up

Building resilient scrapers requires a shift in mindset. Instead of asking "Where is this element?", ask "What makes this element unique?" By moving away from brittle DOM paths and embracing semantic anchors, you significantly reduce maintenance overhead.

Key Takeaways:

  • Avoid Absolute Paths: Never use auto-generated XPaths or CSS paths from DevTools.

  • Use Semantic Anchors: Prioritize data-testid, aria-labels, and stable text.

  • Scope Your Searches: Find a stable parent and search within it to avoid global layout shifts.

  • Fail Informatively: Log HTML snippets when extractions fail to speed up the refactoring process.

To take your scraping further, see our guides on Handling Dynamic Content with Playwright and Rotating Proxies to avoid detection while your selectors do their work. For a deeper look at how different proxy APIs and browser fingerprinting solutions compare, check the Proxy API and Browser Fingerprint Benchmark.

More from this blog

Code Journal

9 posts