playwright-scraping-advanced — AI Skill

Three Pillars of Advanced Scraping

Most Playwright tutorials stop at page.goto() and page.locator(). Production scrapers need to go deeper — intercepting the API calls that power dynamic pages, reaching into the Chromium engine via CDP, and evading the fingerprinting systems that block headless browsers within milliseconds of the first request.

Skill pillars
in this file

10+

Bot-detector
signals patched

40%

Load time saved
by blocking assets

PILLAR 01

Network Interception

page.route() intercepts every request before it leaves the browser. Capture JSON from background XHR/fetch calls, mock API responses for testing, block analytics/images to speed up crawling, intercept WebSocket frames, and record or replay traffic as HAR files.

PILLAR 02

Chrome DevTools Protocol (CDP)

CDP gives direct access to Chromium's internals below Playwright's abstraction. Collect JS coverage, measure performance timing, throttle CPU/network to simulate mobile, override geolocation, set cookies at the protocol level, and listen to raw browser events. Chromium only — Firefox and WebKit do not support CDP.

PILLAR 03

Stealth & Anti-Bot Evasion

Bot detectors check dozens of browser signals: navigator.webdriver, canvas fingerprints, WebGL strings, plugin arrays, mouse trajectory, timing anomalies, and IP reputation. This pillar covers baseline Chrome flags, playwright-stealth, manual fingerprint patches, human mouse simulation, proxy rotation, captcha solving, and Cloudflare bypass strategies.

Environment Setup

bashinstallation

# Core — always required
pip install playwright
playwright install chromium        # preferred for CDP + stealth

# Stealth plugin (Python)
pip install playwright-stealth

# Node.js stealth alternative
npm install playwright-extra playwright-extra-plugin-stealth

# Optional helpers
pip install httpx faker            # proxy rotation, realistic UAs

Always Use Chromium for Advanced Scraping

Firefox and WebKit lack full CDP support. Chromium is also the only engine where playwright-stealth and fingerprint patches are effective, since bot detectors are built to detect non-Chrome browsers out of context.

Universal Boilerplate

Start every advanced scraper with this skeleton, then layer in the patterns from each pillar. Apply stealth before goto(), register routes before navigation, open CDP session after context creation.

pythonbase scraper skeleton — async

import asyncio
from playwright.async_api import async_playwright, Page, BrowserContext

STEALTH_ARGS = [
    "--no-sandbox",
    "--disable-blink-features=AutomationControlled",   # removes webdriver flag
    "--disable-infobars",
    "--disable-dev-shm-usage",
]

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=STEALTH_ARGS,
        )
        context: BrowserContext = await browser.new_context(
            viewport={"width": 1366, "height": 768},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
            locale="en-US",
            timezone_id="America/New_York",
        )
        page: Page = await context.new_page()

        # 1. Stealth patches (see Pillar 3)
        # 2. Route handlers (see Pillar 1)
        # 3. CDP session (see Pillar 2)
        # 4. Navigate last
        await page.goto("https://target.com", wait_until="networkidle")

        await browser.close()

asyncio.run(main())

Pillar 1 — Network Interception

page.route(pattern, handler) intercepts all matching requests before they leave the browser. The handler must call exactly one terminal method — route.continue_(), route.fulfill(),route.abort(), or route.fetch() followed by route.fulfill() — or the request hangs forever.

Goal	Method
Read-only capture, no side effects	`page.on("response", ...)`
Capture + pass through	`route.fetch()` → `route.fulfill(response=resp)`
Fully mock response	`route.fulfill(status=200, body=...)`
Block resource	`route.abort()`
Modify headers only	`route.continue_(headers=...)`
Redirect URL	`route.continue_(url=new_url)`
WebSocket tap	`page.route_web_socket(...)`
Full offline replay	`context.route_from_har(...)`

Capturing API Responses

The most common use case: grab JSON from background XHR/fetch calls instead of scraping rendered HTML. This gives you clean, structured data directly from the source.

pythoncapture XHR/fetch JSON — intercept + pass-through

import json

captured_data = []

async def intercept(route, request):
    response = await route.fetch()           # fetch from real server
    body = await response.body()
    try:
        captured_data.append(json.loads(body))
    except Exception:
        pass
    await route.fulfill(response=response)   # pass through to page

await page.route("**/api/products*", intercept)
await page.goto("https://shop.example.com/catalog")
await page.wait_for_load_state("networkidle")
print(f"Captured {len(captured_data)} API responses")

pythonread-only capture with page.on("response")

# Simpler for read-only — no route needed, no risk of hanging
responses = []
page.on("response", lambda resp: responses.append(resp) if "api/items" in resp.url else None)

await page.goto("https://example.com")
await page.wait_for_load_state("networkidle")

for resp in responses:
    if resp.ok:
        data = await resp.json()
        print(data)

Intercepting Paginated APIs

pythoncapture all pages via Load More button

all_pages = []

async def capture_page(route, request):
    response = await route.fetch()
    data = await response.json()
    all_pages.append(data)
    await route.fulfill(response=response)

await page.route("**/api/items*", capture_page)
await page.goto("https://example.com/items")

# Click "Load More" until it disappears
while await page.query_selector("#load-more"):
    await page.click("#load-more")
    await page.wait_for_load_state("networkidle")

print(f"Total pages captured: {len(all_pages)}")

Blocking Unwanted Resources

pythonsmart resource blocker — 40–70% faster loads

BLOCKED_TYPES   = {"image", "media", "font", "stylesheet"}
BLOCKED_DOMAINS = {"google-analytics.com", "doubleclick.net", "hotjar.com", "facebook.com"}

async def smart_block(route, request):
    if request.resource_type in BLOCKED_TYPES:
        await route.abort()
        return
    if any(d in request.url for d in BLOCKED_DOMAINS):
        await route.abort()
        return
    await route.continue_()

await page.route("**/*", smart_block)

WebSocket Interception

Playwright 1.38+ exposes WebSocketRoute for full WS control — ideal for scraping real-time dashboards, live trading feeds, or chat applications.

pythoncapture and forward WebSocket frames

async def ws_handler(ws_route):
    messages = []

    async def on_message(message):
        messages.append(message)       # log inbound frames
        await ws_route.send(message)   # forward to page unchanged

    ws_route.on("message", on_message)
    await ws_route.connect()           # connect to real server

await page.route_web_socket("wss://stream.example.com/**", ws_handler)
await page.goto("https://example.com/live")
await page.wait_for_timeout(10_000)    # collect 10s of frames
print(f"Captured {len(messages)} WS frames")

HAR Recording & Offline Replay

pythonrecord HAR → replay fully offline

# RECORD — capture all API traffic to HAR
context = await browser.new_context(
    record_har_path="capture.har",
    record_har_url_filter="**/api/**",    # only API traffic
    record_har_content="attach",          # embed response bodies
)
page = await context.new_page()
await page.goto("https://example.com")
await page.wait_for_load_state("networkidle")
await context.close()    # HAR written on close

# REPLAY — serve all requests from HAR, no real network
context = await browser.new_context()
await context.route_from_har(
    "capture.har",
    url="**/api/**",
    update=False,         # strict mode — abort on miss
)
page = await context.new_page()
await page.goto("https://example.com")   # fully offline

Pillar 2 — Chrome DevTools Protocol (CDP)

CDP provides direct access to Chromium's internals below Playwright's high-level API. Open a session with page.context.new_cdp_session(page), then send commands and listen to events from any CDP domain. Always enable the domain before listening to its events.

pythonopening a CDP session

# Open session scoped to a specific page
client = await page.context.new_cdp_session(page)

# Enable a domain before using it
await client.send("Network.enable")

# Send a command and get the result
result = await client.send("Network.getAllCookies")
cookies = result["cookies"]

# Listen to events
client.on("Network.requestWillBeSent", lambda p: print(p["request"]["url"]))

# Detach when done (optional — auto-closes with context)
await client.detach()

Network Domain — Richer Than page.route()

pythoncapture full request/response details via CDP

client = await page.context.new_cdp_session(page)
await client.send("Network.enable")

requests: dict = {}

def on_request(params):
    requests[params["requestId"]] = {
        "url":     params["request"]["url"],
        "method":  params["request"]["method"],
        "headers": params["request"]["headers"],
    }

def on_response(params):
    rid = params["requestId"]
    if rid in requests:
        requests[rid]["status"] = params["response"]["status"]
        requests[rid]["mime"]   = params["response"]["mimeType"]

client.on("Network.requestWillBeSent", on_request)
client.on("Network.responseReceived",  on_response)

await page.goto("https://example.com")

# Get response body via CDP (bypasses Playwright's body restrictions)
async def get_body(request_id: str) -> bytes:
    result = await client.send("Network.getResponseBody", {"requestId": request_id})
    if result.get("base64Encoded"):
        import base64
        return base64.b64decode(result["body"])
    return result["body"].encode()

Performance Metrics & JS Coverage

pythonperformance metrics + JS coverage collection

# Performance metrics
await client.send("Performance.enable")
metrics_result = await client.send("Performance.getMetrics")
metrics = {m["name"]: m["value"] for m in metrics_result["metrics"]}
print(f"DOM Content Loaded: {metrics.get('DOMContentLoaded', 0):.2f}ms")
print(f"JS Heap Used: {metrics.get('JSHeapUsedSize', 0) / 1e6:.1f} MB")

# JS coverage — which code actually ran?
await client.send("Profiler.enable")
await client.send("Profiler.startPreciseCoverage", {"callCount": True, "detailed": True})

await page.goto("https://example.com")
await page.wait_for_load_state("networkidle")

result = await client.send("Profiler.takePreciseCoverage")
for script in result["result"]:
    total   = sum(r["endOffset"] - r["startOffset"] for r in script["functions"])
    covered = sum(
        r["endOffset"] - r["startOffset"]
        for fn in script["functions"]
        for r in fn["ranges"] if r["count"] > 0
    )
    pct = (covered / total * 100) if total else 0
    print(f"{script['url']}: {pct:.1f}% executed")

Emulation: CPU, Network & Geolocation

pythonthrottling and geolocation override

# CPU throttling — simulate mid-range Android device (4× slowdown)
await client.send("Emulation.setCPUThrottlingRate", {"rate": 4})

# Network throttling — simulate 3G connection
await client.send("Network.emulateNetworkConditions", {
    "offline":             False,
    "downloadThroughput":  1.5 * 1024 * 1024 / 8,   # 1.5 Mbps
    "uploadThroughput":    750 * 1024 / 8,            # 750 Kbps
    "latency":             150,                       # 150ms RTT
})

# Simulate offline
await client.send("Network.emulateNetworkConditions", {
    "offline": True, "downloadThroughput": -1, "uploadThroughput": -1, "latency": 0,
})

# Override geolocation
await client.send("Emulation.setGeolocationOverride", {
    "latitude":  1.3521,    # Singapore
    "longitude": 103.8198,
    "accuracy":  100,
})

Full CDP Command Reference

Domain	Command	Purpose
`Network`	`Network.enable`	Start network events
`Network`	`Network.getResponseBody`	Fetch body after load
`Network`	`Network.setBlockedURLs`	Block URL patterns
`Network`	`Network.emulateNetworkConditions`	Simulate slow network
`Network`	`Network.getAllCookies`	Read all cookies
`Performance`	`Performance.getMetrics`	Perf counters
`Profiler`	`startPreciseCoverage`	JS coverage collection
`Emulation`	`setCPUThrottlingRate`	CPU slowdown
`Emulation`	`setGeolocationOverride`	Fake GPS coordinates
`Emulation`	`setDeviceMetricsOverride`	Custom viewport
`Security`	`setIgnoreCertificateErrors`	Skip TLS errors
`Runtime`	`evaluate`	Run JS in page context
`Runtime`	`addBinding`	Expose callback to page
`Input`	`dispatchKeyEvent`	Synthetic keyboard

Pillar 3 — Stealth & Anti-Bot Evasion

Modern bot-detection systems check dozens of browser signals simultaneously. A headless Chromium without stealth patches fails detection within milliseconds. This section covers the full layered defence: baseline flags, the playwright-stealth library, manual fingerprint patches, human interaction simulation, proxy rotation, and captcha solving.

Threat Model — What Bot Detectors Check

Signal	What Detectors Check	Countermeasure
`navigator.webdriver`	`=== true`	Override to `undefined` before load
Chrome automation extension	`window.chrome.app` missing	Spoof `window.chrome` object
Permissions API	Returns `denied` for notifications	Patch API return value
Plugin array	Empty `navigator.plugins`	Inject fake plugin list (≥3 entries)
WebGL renderer	ANGLE headless string	Override via canvas hook
Canvas fingerprint	Deterministic pixel output	Add subtle random noise
User-Agent mismatch	UA says Windows, platform says Linux	Sync all UA fields
Mouse trajectory	Teleporting cursor, no `mouseover`	Bezier curve movement
IP reputation	Datacenter IP range	Residential proxy rotation
TLS fingerprint (JA3)	Node.js TLS differs from real Chrome	Use real Chromium binary

Baseline Chromium Flags — Free Stealth

pythonstealth launch args — always include these

STEALTH_ARGS = [
    "--no-sandbox",
    "--disable-blink-features=AutomationControlled",    # removes webdriver flag
    "--disable-infobars",
    "--disable-dev-shm-usage",
    "--disable-browser-side-navigation",
    "--disable-features=IsolateOrigins,site-per-process",
    "--flag-switches-begin",
    "--disable-site-isolation-trials",
    "--flag-switches-end",
]

browser = await p.chromium.launch(headless=True, args=STEALTH_ARGS)

# Always set a matching real user-agent
context = await browser.new_context(
    user_agent=(
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
)

playwright-stealth (Python)

playwright-stealth patches the most common fingerprint leaks automatically. Apply it to the page before any navigation.

pythonplaywright-stealth — full setup

from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

async with async_playwright() as p:
    browser = await p.chromium.launch(headless=True, args=STEALTH_ARGS)
    page    = await browser.new_page()

    await stealth_async(page)   # ← MUST be called before goto()

    await page.goto("https://bot.sannysoft.com")   # detection test page
    await page.screenshot(path="stealth_test.png")

What playwright-stealth patches automatically

navigator.webdriver → undefined · window.chrome → realistic object · navigator.plugins → fake 3-entry list · navigator.languages → ["en-US","en"] · WebGL vendor/renderer strings · Permissions API · window.outerWidth/Height · screen properties

Manual Fingerprint Patches via addInitScript

When libraries are unavailable, patch manually. addInitScript runs before any page JavaScript — the only reliable injection point.

pythonfingerprint patches — injected before page JS

FINGERPRINT_PATCHES = """
// 1. Remove webdriver flag
Object.defineProperty(navigator, 'webdriver', {
    get: () => undefined, configurable: true,
});

// 2. Fake chrome object
window.chrome = {
    app: { isInstalled: false, InstallState: {}, RunningState: {} },
    runtime: {}, loadTimes: function() {}, csi: function() {},
};

// 3. Fake plugins (need at least 3)
Object.defineProperty(navigator, 'plugins', {
    get: () => {
        const plugins = [
            { name: 'Chrome PDF Plugin',  filename: 'internal-pdf-viewer' },
            { name: 'Chrome PDF Viewer',  filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai' },
            { name: 'Native Client',      filename: 'internal-nacl-plugin' },
        ];
        plugins.__proto__ = PluginArray.prototype;
        return plugins;
    },
});

// 4. Fix permissions API
const _origQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (p) => (
    p.name === 'notifications'
        ? Promise.resolve({ state: Notification.permission })
        : _origQuery(p)
);

// 5. Canvas noise — subtle per-session randomisation
const _origGetContext = HTMLCanvasElement.prototype.getContext;
HTMLCanvasElement.prototype.getContext = function(type, ...args) {
    const ctx = _origGetContext.apply(this, [type, ...args]);
    if (type === '2d' && ctx) {
        const _origGetImageData = ctx.getImageData;
        ctx.getImageData = function(...a) {
            const img = _origGetImageData.apply(this, a);
            for (let i = 0; i < img.data.length; i += 100) {
                img.data[i] = img.data[i] ^ (Math.random() * 3 | 0);
            }
            return img;
        };
    }
    return ctx;
};
"""

await context.add_init_script(FINGERPRINT_PATCHES)  # apply to all pages in context

Human Simulation — Mouse, Keyboard & Scroll

pythonbezier mouse movement + human typing + realistic scroll

import math, random, asyncio

# Natural mouse movement along a cubic Bezier curve
async def bezier_move(page, x1, y1, x2, y2, steps=30):
    cx1 = x1 + random.randint(20, 80);  cy1 = y1 + random.randint(-40, 40)
    cx2 = x2 - random.randint(20, 80);  cy2 = y2 + random.randint(-40, 40)
    for i in range(steps + 1):
        t = i / steps
        x = (1-t)**3*x1 + 3*(1-t)**2*t*cx1 + 3*(1-t)*t**2*cx2 + t**3*x2
        y = (1-t)**3*y1 + 3*(1-t)**2*t*cy1 + 3*(1-t)*t**2*cy2 + t**3*y2
        await page.mouse.move(x, y)
        await asyncio.sleep(random.uniform(0.005, 0.015))

# Human-like typing with per-keystroke delays
async def human_type(page, selector: str, text: str):
    await page.click(selector)
    await asyncio.sleep(random.uniform(0.3, 0.7))
    for char in text:
        await page.keyboard.type(char)
        await asyncio.sleep(random.uniform(0.05, 0.18))
    await asyncio.sleep(random.uniform(0.2, 0.5))

# Realistic scroll with jitter
async def scroll_down(page, total_px=2000, steps=15):
    per_step = total_px // steps
    for _ in range(steps):
        await page.mouse.wheel(0, per_step + random.randint(-30, 30))
        await asyncio.sleep(random.uniform(0.05, 0.2))

# Random wait helper — use between every action
async def jitter(min_ms=500, max_ms=2000):
    await asyncio.sleep(random.uniform(min_ms, max_ms) / 1000)

# Usage
await bezier_move(page, 100, 200, 400, 350)
await page.mouse.click(400, 350)
await jitter(300, 800)
await human_type(page, "#search", "playwright scraping")

Proxy Rotation

pythonper-request proxy rotation

PROXIES = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "socks5://user:pass@proxy3.example.com:1080",
]

async def scrape_with_rotation(urls: list[str]):
    async with async_playwright() as p:
        results = []
        for i, url in enumerate(urls):
            proxy_url = PROXIES[i % len(PROXIES)]
            browser = await p.chromium.launch(
                proxy={"server": proxy_url},
                args=STEALTH_ARGS,
            )
            context = await browser.new_context(
                proxy={"server": proxy_url},
                user_agent="Mozilla/5.0 ...",
            )
            page = await context.new_page()
            await page.goto(url)
            results.append(await page.content())
            await browser.close()
        return results

# Residential proxy (e.g. Bright Data)
PROXY_CONFIG = {
    "server":   "http://brd.superproxy.io:22225",
    "username": "YOUR_ZONE_USERNAME",
    "password": "YOUR_PASSWORD",
}
context = await browser.new_context(proxy=PROXY_CONFIG)

Captcha Solving

OPTION A — 2captcha

reCAPTCHA v2/v3 solving service

pip install 2captcha-python. Sends the sitekey to the 2captcha API, which uses human workers to solve it and returns a token. Good for reCAPTCHA v2/v3. Costs ~$1–3 per 1000 solves.

OPTION B — capsolver (recommended)

Supports Cloudflare Turnstile

pip install capsolver. Supports reCAPTCHA, hCaptcha, and Cloudflare Turnstile via AntiTurnstileTaskProxyLess. Faster than 2captcha for Turnstile challenges.

OPTION C — FlareSolverr (self-hosted)

Free Cloudflare bypass via Docker

Run docker run -d -p 8191:8191 ghcr.io/flaresolverr/flaresolverr:latest and send requests to its HTTP API. Returns cookies and HTML after solving the JS challenge. No per-solve cost, but requires self-hosting.

pythonFlareSolverr — self-hosted Cloudflare bypass

import httpx

async def flaresolverr_get(url: str) -> str:
    async with httpx.AsyncClient() as client:
        r = await client.post("http://localhost:8191/v1", json={
            "cmd":        "request.get",
            "url":        url,
            "maxTimeout": 60000,
        })
        return r.json()["solution"]["response"]

html = await flaresolverr_get("https://cloudflare-protected-site.com")

Cloudflare / DataDome / Imperva Strategies

Cloudflare Bot Management — Key Rules

1. Use real Chromium with all stealth patches applied. 2. Residential proxy — datacenter IPs are pre-blocked. 3. Add a realistic delay (3–8 s) before any interaction after landing. 4. Do NOT abort CSS or font resources — Cloudflare uses resource loading as a bot signal. 5. Set Sec-Fetch-* and Sec-CH-UA headers consistently.

pythonCloudflare-compatible context headers

context = await browser.new_context(
    extra_http_headers={
        "Accept-Language":    "en-US,en;q=0.9",
        "Sec-Ch-Ua":          '"Chromium";v="124", "Google Chrome";v="124"',
        "Sec-Ch-Ua-Mobile":   "?0",
        "Sec-Ch-Ua-Platform": '"Windows"',
    }
)

Combining All Three Pillars

When a task requires stealth + interception + CDP simultaneously, apply in this exact order: stealth patches first, then route handlers, then CDP session, then navigate.

pythonfull-stack example — stealth + network + CDP

async def full_stack_scrape(url: str) -> list:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True, args=STEALTH_ARGS)
        context = await browser.new_context(
            viewport={"width": 1366, "height": 768},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...",
        )
        page = await context.new_page()

        # 1. Stealth patches — BEFORE anything else
        await stealth_async(page)

        # 2. Network interception — registered before goto()
        captured = []
        async def capture_api(route, request):
            if "api/v2/products" in request.url:
                response = await route.fetch()
                captured.append(await response.json())
                await route.fulfill(response=response)
            else:
                await route.continue_()
        await page.route("**/*", capture_api)

        # 3. CDP session — after context creation
        client = await context.new_cdp_session(page)
        await client.send("Network.enable")

        # 4. Navigate — always last
        await jitter(500, 1200)                      # pre-load delay
        await page.goto(url, wait_until="networkidle")
        await jitter(2000, 4000)                     # post-load human pause

        # Scroll to trigger lazy-loaded API calls
        await scroll_down(page, total_px=3000)
        await page.wait_for_load_state("networkidle")

        await browser.close()
        return captured

Detection Testing Checklist

Always test your stealth setup against these pages before deploying. A green result on all four means your browser looks convincingly human.

pythonautomated stealth verification

TEST_URLS = [
    "https://bot.sannysoft.com",                     # webdriver, plugins, chrome object
    "https://fingerprintjs.com/demo/",               # fingerprint hash
    "https://deviceandbrowserinfo.com/",             # detailed browser signals
    "https://abrahamjuliot.github.io/creepjs/",      # overall creak score
]

for url in TEST_URLS:
    await page.goto(url)
    await page.screenshot(path=f"test_{url.split('/')[2]}.png")

# Inline JS assertions — run after stealth applied
checks = await page.evaluate("""() => ({
    webdriver: navigator.webdriver,
    hasChrome:  !!window.chrome,
    plugins:    navigator.plugins.length,
    ua:         navigator.userAgent,
    platform:   navigator.platform,
    langs:      navigator.languages,
})""")

assert checks["webdriver"] is None or checks["webdriver"] is False
assert checks["hasChrome"]  is True
assert checks["plugins"]    >= 3
assert "HeadlessChrome" not in checks["ua"]
print("✓ All stealth checks passed")

Common Pitfalls

Pitfall	Fix
`route.continue_()` not awaited	Always `await route.continue_()` — unawaited routes hang forever
CDP session on wrong target	Use `page.context.new_cdp_session(page)`, not `browser.new_cdp_session()`
`navigator.webdriver` still `true`	Stealth must be applied via `addInitScript` BEFORE `goto()`
Stealth plugin applied to page, not context	Apply stealth to the context for full coverage across all pages
HAR recording misses service workers	Set `record_har_url_filter` broadly; SW traffic needs CDP interception
Cloudflare JS challenge timeout	Increase `wait_until="networkidle"` timeout; add random pre-challenge delay
Blocking CSS/fonts on Cloudflare-protected sites	Do NOT block stylesheet or font resources — CF uses them as a bot signal
Datacenter IP blocked on first request	Residential proxy is non-negotiable for sites with strict IP reputation checks

Anti-Patterns to Avoid

Navigating before applying stealth patches — bot detectors see the real fingerprint on first load
Calling route.continue_() without await — the request hangs indefinitely
Using Firefox or WebKit for CDP-heavy scrapers — CDP is Chromium-only
Blocking CSS and fonts on Cloudflare-protected sites — resource loading patterns are a detection signal
Using datacenter IPs against sites with IP reputation systems — switch to residential proxies
Teleporting the mouse directly to elements — always move along a bezier curve first
Opening a CDP session on browser instead of page.context
Forgetting await context.close() before reading HAR — the file is incomplete until context closes
Hard-coding time.sleep() instead of jitter() — fixed delays are trivially fingerprinted

AI Skill File

Download playwright-scraping-advanced Skill

This .skill file contains four complete reference documents covering every aspect of advanced Playwright scraping — network interception, CDP commands, stealth fingerprint patches, and anti-bot strategies — ready to load into Claude or any AI tool as expert context for your scraping questions.

✓ Network interception guide

✓ Full CDP reference

✓ Stealth patch scripts

✓ Cloudflare bypass strategies

✓ Human simulation helpers

✓ Proxy rotation patterns

⬇ Download Skill File

Hosted by ZynU Host · host.zynu.net