Three Pillars of Advanced Scraping
Most Playwright tutorials stop at page.goto() and page.locator(). Production scrapers need to go deeper — intercepting the API calls that power dynamic pages, reaching into the Chromium engine via CDP, and evading the fingerprinting systems that block headless browsers within milliseconds of the first request.
in this file
signals patched
by blocking assets
page.route() intercepts every request before it leaves the browser. Capture JSON from background XHR/fetch calls, mock API responses for testing, block analytics/images to speed up crawling, intercept WebSocket frames, and record or replay traffic as HAR files.navigator.webdriver, canvas fingerprints, WebGL strings, plugin arrays, mouse trajectory, timing anomalies, and IP reputation. This pillar covers baseline Chrome flags, playwright-stealth, manual fingerprint patches, human mouse simulation, proxy rotation, captcha solving, and Cloudflare bypass strategies.Environment Setup
# Core — always required pip install playwright playwright install chromium # preferred for CDP + stealth # Stealth plugin (Python) pip install playwright-stealth # Node.js stealth alternative npm install playwright-extra playwright-extra-plugin-stealth # Optional helpers pip install httpx faker # proxy rotation, realistic UAs
Always Use Chromium for Advanced Scraping
Firefox and WebKit lack full CDP support. Chromium is also the only engine where playwright-stealth and fingerprint patches are effective, since bot detectors are built to detect non-Chrome browsers out of context.
Universal Boilerplate
Start every advanced scraper with this skeleton, then layer in the patterns from each pillar. Apply stealth before goto(), register routes before navigation, open CDP session after context creation.
import asyncio
from playwright.async_api import async_playwright, Page, BrowserContext
STEALTH_ARGS = [
"--no-sandbox",
"--disable-blink-features=AutomationControlled", # removes webdriver flag
"--disable-infobars",
"--disable-dev-shm-usage",
]
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
args=STEALTH_ARGS,
)
context: BrowserContext = await browser.new_context(
viewport={"width": 1366, "height": 768},
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
locale="en-US",
timezone_id="America/New_York",
)
page: Page = await context.new_page()
# 1. Stealth patches (see Pillar 3)
# 2. Route handlers (see Pillar 1)
# 3. CDP session (see Pillar 2)
# 4. Navigate last
await page.goto("https://target.com", wait_until="networkidle")
await browser.close()
asyncio.run(main())Pillar 1 — Network Interception
page.route(pattern, handler) intercepts all matching requests before they leave the browser. The handler must call exactly one terminal method — route.continue_(), route.fulfill(),route.abort(), or route.fetch() followed by route.fulfill() — or the request hangs forever.
| Goal | Method |
|---|---|
| Read-only capture, no side effects | page.on("response", ...) |
| Capture + pass through | route.fetch() → route.fulfill(response=resp) |
| Fully mock response | route.fulfill(status=200, body=...) |
| Block resource | route.abort() |
| Modify headers only | route.continue_(headers=...) |
| Redirect URL | route.continue_(url=new_url) |
| WebSocket tap | page.route_web_socket(...) |
| Full offline replay | context.route_from_har(...) |
Capturing API Responses
The most common use case: grab JSON from background XHR/fetch calls instead of scraping rendered HTML. This gives you clean, structured data directly from the source.
import json
captured_data = []
async def intercept(route, request):
response = await route.fetch() # fetch from real server
body = await response.body()
try:
captured_data.append(json.loads(body))
except Exception:
pass
await route.fulfill(response=response) # pass through to page
await page.route("**/api/products*", intercept)
await page.goto("https://shop.example.com/catalog")
await page.wait_for_load_state("networkidle")
print(f"Captured {len(captured_data)} API responses")# Simpler for read-only — no route needed, no risk of hanging
responses = []
page.on("response", lambda resp: responses.append(resp) if "api/items" in resp.url else None)
await page.goto("https://example.com")
await page.wait_for_load_state("networkidle")
for resp in responses:
if resp.ok:
data = await resp.json()
print(data)Intercepting Paginated APIs
all_pages = []
async def capture_page(route, request):
response = await route.fetch()
data = await response.json()
all_pages.append(data)
await route.fulfill(response=response)
await page.route("**/api/items*", capture_page)
await page.goto("https://example.com/items")
# Click "Load More" until it disappears
while await page.query_selector("#load-more"):
await page.click("#load-more")
await page.wait_for_load_state("networkidle")
print(f"Total pages captured: {len(all_pages)}")Blocking Unwanted Resources
BLOCKED_TYPES = {"image", "media", "font", "stylesheet"}
BLOCKED_DOMAINS = {"google-analytics.com", "doubleclick.net", "hotjar.com", "facebook.com"}
async def smart_block(route, request):
if request.resource_type in BLOCKED_TYPES:
await route.abort()
return
if any(d in request.url for d in BLOCKED_DOMAINS):
await route.abort()
return
await route.continue_()
await page.route("**/*", smart_block)WebSocket Interception
Playwright 1.38+ exposes WebSocketRoute for full WS control — ideal for scraping real-time dashboards, live trading feeds, or chat applications.
async def ws_handler(ws_route):
messages = []
async def on_message(message):
messages.append(message) # log inbound frames
await ws_route.send(message) # forward to page unchanged
ws_route.on("message", on_message)
await ws_route.connect() # connect to real server
await page.route_web_socket("wss://stream.example.com/**", ws_handler)
await page.goto("https://example.com/live")
await page.wait_for_timeout(10_000) # collect 10s of frames
print(f"Captured {len(messages)} WS frames")HAR Recording & Offline Replay
# RECORD — capture all API traffic to HAR
context = await browser.new_context(
record_har_path="capture.har",
record_har_url_filter="**/api/**", # only API traffic
record_har_content="attach", # embed response bodies
)
page = await context.new_page()
await page.goto("https://example.com")
await page.wait_for_load_state("networkidle")
await context.close() # HAR written on close
# REPLAY — serve all requests from HAR, no real network
context = await browser.new_context()
await context.route_from_har(
"capture.har",
url="**/api/**",
update=False, # strict mode — abort on miss
)
page = await context.new_page()
await page.goto("https://example.com") # fully offlinePillar 2 — Chrome DevTools Protocol (CDP)
CDP provides direct access to Chromium's internals below Playwright's high-level API. Open a session with page.context.new_cdp_session(page), then send commands and listen to events from any CDP domain. Always enable the domain before listening to its events.
# Open session scoped to a specific page
client = await page.context.new_cdp_session(page)
# Enable a domain before using it
await client.send("Network.enable")
# Send a command and get the result
result = await client.send("Network.getAllCookies")
cookies = result["cookies"]
# Listen to events
client.on("Network.requestWillBeSent", lambda p: print(p["request"]["url"]))
# Detach when done (optional — auto-closes with context)
await client.detach()Network Domain — Richer Than page.route()
client = await page.context.new_cdp_session(page)
await client.send("Network.enable")
requests: dict = {}
def on_request(params):
requests[params["requestId"]] = {
"url": params["request"]["url"],
"method": params["request"]["method"],
"headers": params["request"]["headers"],
}
def on_response(params):
rid = params["requestId"]
if rid in requests:
requests[rid]["status"] = params["response"]["status"]
requests[rid]["mime"] = params["response"]["mimeType"]
client.on("Network.requestWillBeSent", on_request)
client.on("Network.responseReceived", on_response)
await page.goto("https://example.com")
# Get response body via CDP (bypasses Playwright's body restrictions)
async def get_body(request_id: str) -> bytes:
result = await client.send("Network.getResponseBody", {"requestId": request_id})
if result.get("base64Encoded"):
import base64
return base64.b64decode(result["body"])
return result["body"].encode()Performance Metrics & JS Coverage
# Performance metrics
await client.send("Performance.enable")
metrics_result = await client.send("Performance.getMetrics")
metrics = {m["name"]: m["value"] for m in metrics_result["metrics"]}
print(f"DOM Content Loaded: {metrics.get('DOMContentLoaded', 0):.2f}ms")
print(f"JS Heap Used: {metrics.get('JSHeapUsedSize', 0) / 1e6:.1f} MB")
# JS coverage — which code actually ran?
await client.send("Profiler.enable")
await client.send("Profiler.startPreciseCoverage", {"callCount": True, "detailed": True})
await page.goto("https://example.com")
await page.wait_for_load_state("networkidle")
result = await client.send("Profiler.takePreciseCoverage")
for script in result["result"]:
total = sum(r["endOffset"] - r["startOffset"] for r in script["functions"])
covered = sum(
r["endOffset"] - r["startOffset"]
for fn in script["functions"]
for r in fn["ranges"] if r["count"] > 0
)
pct = (covered / total * 100) if total else 0
print(f"{script['url']}: {pct:.1f}% executed")Emulation: CPU, Network & Geolocation
# CPU throttling — simulate mid-range Android device (4× slowdown)
await client.send("Emulation.setCPUThrottlingRate", {"rate": 4})
# Network throttling — simulate 3G connection
await client.send("Network.emulateNetworkConditions", {
"offline": False,
"downloadThroughput": 1.5 * 1024 * 1024 / 8, # 1.5 Mbps
"uploadThroughput": 750 * 1024 / 8, # 750 Kbps
"latency": 150, # 150ms RTT
})
# Simulate offline
await client.send("Network.emulateNetworkConditions", {
"offline": True, "downloadThroughput": -1, "uploadThroughput": -1, "latency": 0,
})
# Override geolocation
await client.send("Emulation.setGeolocationOverride", {
"latitude": 1.3521, # Singapore
"longitude": 103.8198,
"accuracy": 100,
})Full CDP Command Reference
| Domain | Command | Purpose |
|---|---|---|
Network | Network.enable | Start network events |
Network | Network.getResponseBody | Fetch body after load |
Network | Network.setBlockedURLs | Block URL patterns |
Network | Network.emulateNetworkConditions | Simulate slow network |
Network | Network.getAllCookies | Read all cookies |
Performance | Performance.getMetrics | Perf counters |
Profiler | startPreciseCoverage | JS coverage collection |
Emulation | setCPUThrottlingRate | CPU slowdown |
Emulation | setGeolocationOverride | Fake GPS coordinates |
Emulation | setDeviceMetricsOverride | Custom viewport |
Security | setIgnoreCertificateErrors | Skip TLS errors |
Runtime | evaluate | Run JS in page context |
Runtime | addBinding | Expose callback to page |
Input | dispatchKeyEvent | Synthetic keyboard |
Pillar 3 — Stealth & Anti-Bot Evasion
Modern bot-detection systems check dozens of browser signals simultaneously. A headless Chromium without stealth patches fails detection within milliseconds. This section covers the full layered defence: baseline flags, the playwright-stealth library, manual fingerprint patches, human interaction simulation, proxy rotation, and captcha solving.
Threat Model — What Bot Detectors Check
| Signal | What Detectors Check | Countermeasure |
|---|---|---|
navigator.webdriver | === true | Override to undefined before load |
| Chrome automation extension | window.chrome.app missing | Spoof window.chrome object |
| Permissions API | Returns denied for notifications | Patch API return value |
| Plugin array | Empty navigator.plugins | Inject fake plugin list (≥3 entries) |
| WebGL renderer | ANGLE headless string | Override via canvas hook |
| Canvas fingerprint | Deterministic pixel output | Add subtle random noise |
| User-Agent mismatch | UA says Windows, platform says Linux | Sync all UA fields |
| Mouse trajectory | Teleporting cursor, no mouseover | Bezier curve movement |
| IP reputation | Datacenter IP range | Residential proxy rotation |
| TLS fingerprint (JA3) | Node.js TLS differs from real Chrome | Use real Chromium binary |
Baseline Chromium Flags — Free Stealth
STEALTH_ARGS = [
"--no-sandbox",
"--disable-blink-features=AutomationControlled", # removes webdriver flag
"--disable-infobars",
"--disable-dev-shm-usage",
"--disable-browser-side-navigation",
"--disable-features=IsolateOrigins,site-per-process",
"--flag-switches-begin",
"--disable-site-isolation-trials",
"--flag-switches-end",
]
browser = await p.chromium.launch(headless=True, args=STEALTH_ARGS)
# Always set a matching real user-agent
context = await browser.new_context(
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
)playwright-stealth (Python)
playwright-stealth patches the most common fingerprint leaks automatically. Apply it to the page before any navigation.
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True, args=STEALTH_ARGS)
page = await browser.new_page()
await stealth_async(page) # ← MUST be called before goto()
await page.goto("https://bot.sannysoft.com") # detection test page
await page.screenshot(path="stealth_test.png")What playwright-stealth patches automatically
navigator.webdriver → undefined · window.chrome → realistic object · navigator.plugins → fake 3-entry list · navigator.languages → ["en-US","en"] · WebGL vendor/renderer strings · Permissions API · window.outerWidth/Height · screen properties
Manual Fingerprint Patches via addInitScript
When libraries are unavailable, patch manually. addInitScript runs before any page JavaScript — the only reliable injection point.
FINGERPRINT_PATCHES = """
// 1. Remove webdriver flag
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined, configurable: true,
});
// 2. Fake chrome object
window.chrome = {
app: { isInstalled: false, InstallState: {}, RunningState: {} },
runtime: {}, loadTimes: function() {}, csi: function() {},
};
// 3. Fake plugins (need at least 3)
Object.defineProperty(navigator, 'plugins', {
get: () => {
const plugins = [
{ name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer' },
{ name: 'Chrome PDF Viewer', filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai' },
{ name: 'Native Client', filename: 'internal-nacl-plugin' },
];
plugins.__proto__ = PluginArray.prototype;
return plugins;
},
});
// 4. Fix permissions API
const _origQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (p) => (
p.name === 'notifications'
? Promise.resolve({ state: Notification.permission })
: _origQuery(p)
);
// 5. Canvas noise — subtle per-session randomisation
const _origGetContext = HTMLCanvasElement.prototype.getContext;
HTMLCanvasElement.prototype.getContext = function(type, ...args) {
const ctx = _origGetContext.apply(this, [type, ...args]);
if (type === '2d' && ctx) {
const _origGetImageData = ctx.getImageData;
ctx.getImageData = function(...a) {
const img = _origGetImageData.apply(this, a);
for (let i = 0; i < img.data.length; i += 100) {
img.data[i] = img.data[i] ^ (Math.random() * 3 | 0);
}
return img;
};
}
return ctx;
};
"""
await context.add_init_script(FINGERPRINT_PATCHES) # apply to all pages in contextHuman Simulation — Mouse, Keyboard & Scroll
import math, random, asyncio
# Natural mouse movement along a cubic Bezier curve
async def bezier_move(page, x1, y1, x2, y2, steps=30):
cx1 = x1 + random.randint(20, 80); cy1 = y1 + random.randint(-40, 40)
cx2 = x2 - random.randint(20, 80); cy2 = y2 + random.randint(-40, 40)
for i in range(steps + 1):
t = i / steps
x = (1-t)**3*x1 + 3*(1-t)**2*t*cx1 + 3*(1-t)*t**2*cx2 + t**3*x2
y = (1-t)**3*y1 + 3*(1-t)**2*t*cy1 + 3*(1-t)*t**2*cy2 + t**3*y2
await page.mouse.move(x, y)
await asyncio.sleep(random.uniform(0.005, 0.015))
# Human-like typing with per-keystroke delays
async def human_type(page, selector: str, text: str):
await page.click(selector)
await asyncio.sleep(random.uniform(0.3, 0.7))
for char in text:
await page.keyboard.type(char)
await asyncio.sleep(random.uniform(0.05, 0.18))
await asyncio.sleep(random.uniform(0.2, 0.5))
# Realistic scroll with jitter
async def scroll_down(page, total_px=2000, steps=15):
per_step = total_px // steps
for _ in range(steps):
await page.mouse.wheel(0, per_step + random.randint(-30, 30))
await asyncio.sleep(random.uniform(0.05, 0.2))
# Random wait helper — use between every action
async def jitter(min_ms=500, max_ms=2000):
await asyncio.sleep(random.uniform(min_ms, max_ms) / 1000)
# Usage
await bezier_move(page, 100, 200, 400, 350)
await page.mouse.click(400, 350)
await jitter(300, 800)
await human_type(page, "#search", "playwright scraping")Proxy Rotation
PROXIES = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"socks5://user:pass@proxy3.example.com:1080",
]
async def scrape_with_rotation(urls: list[str]):
async with async_playwright() as p:
results = []
for i, url in enumerate(urls):
proxy_url = PROXIES[i % len(PROXIES)]
browser = await p.chromium.launch(
proxy={"server": proxy_url},
args=STEALTH_ARGS,
)
context = await browser.new_context(
proxy={"server": proxy_url},
user_agent="Mozilla/5.0 ...",
)
page = await context.new_page()
await page.goto(url)
results.append(await page.content())
await browser.close()
return results
# Residential proxy (e.g. Bright Data)
PROXY_CONFIG = {
"server": "http://brd.superproxy.io:22225",
"username": "YOUR_ZONE_USERNAME",
"password": "YOUR_PASSWORD",
}
context = await browser.new_context(proxy=PROXY_CONFIG)Captcha Solving
pip install 2captcha-python. Sends the sitekey to the 2captcha API, which uses human workers to solve it and returns a token. Good for reCAPTCHA v2/v3. Costs ~$1–3 per 1000 solves.pip install capsolver. Supports reCAPTCHA, hCaptcha, and Cloudflare Turnstile via AntiTurnstileTaskProxyLess. Faster than 2captcha for Turnstile challenges.docker run -d -p 8191:8191 ghcr.io/flaresolverr/flaresolverr:latest and send requests to its HTTP API. Returns cookies and HTML after solving the JS challenge. No per-solve cost, but requires self-hosting.import httpx
async def flaresolverr_get(url: str) -> str:
async with httpx.AsyncClient() as client:
r = await client.post("http://localhost:8191/v1", json={
"cmd": "request.get",
"url": url,
"maxTimeout": 60000,
})
return r.json()["solution"]["response"]
html = await flaresolverr_get("https://cloudflare-protected-site.com")Cloudflare / DataDome / Imperva Strategies
Cloudflare Bot Management — Key Rules
1. Use real Chromium with all stealth patches applied. 2. Residential proxy — datacenter IPs are pre-blocked. 3. Add a realistic delay (3–8 s) before any interaction after landing. 4. Do NOT abort CSS or font resources — Cloudflare uses resource loading as a bot signal. 5. Set Sec-Fetch-* and Sec-CH-UA headers consistently.
context = await browser.new_context(
extra_http_headers={
"Accept-Language": "en-US,en;q=0.9",
"Sec-Ch-Ua": '"Chromium";v="124", "Google Chrome";v="124"',
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": '"Windows"',
}
)Combining All Three Pillars
When a task requires stealth + interception + CDP simultaneously, apply in this exact order: stealth patches first, then route handlers, then CDP session, then navigate.
async def full_stack_scrape(url: str) -> list:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True, args=STEALTH_ARGS)
context = await browser.new_context(
viewport={"width": 1366, "height": 768},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...",
)
page = await context.new_page()
# 1. Stealth patches — BEFORE anything else
await stealth_async(page)
# 2. Network interception — registered before goto()
captured = []
async def capture_api(route, request):
if "api/v2/products" in request.url:
response = await route.fetch()
captured.append(await response.json())
await route.fulfill(response=response)
else:
await route.continue_()
await page.route("**/*", capture_api)
# 3. CDP session — after context creation
client = await context.new_cdp_session(page)
await client.send("Network.enable")
# 4. Navigate — always last
await jitter(500, 1200) # pre-load delay
await page.goto(url, wait_until="networkidle")
await jitter(2000, 4000) # post-load human pause
# Scroll to trigger lazy-loaded API calls
await scroll_down(page, total_px=3000)
await page.wait_for_load_state("networkidle")
await browser.close()
return capturedDetection Testing Checklist
Always test your stealth setup against these pages before deploying. A green result on all four means your browser looks convincingly human.
TEST_URLS = [
"https://bot.sannysoft.com", # webdriver, plugins, chrome object
"https://fingerprintjs.com/demo/", # fingerprint hash
"https://deviceandbrowserinfo.com/", # detailed browser signals
"https://abrahamjuliot.github.io/creepjs/", # overall creak score
]
for url in TEST_URLS:
await page.goto(url)
await page.screenshot(path=f"test_{url.split('/')[2]}.png")
# Inline JS assertions — run after stealth applied
checks = await page.evaluate("""() => ({
webdriver: navigator.webdriver,
hasChrome: !!window.chrome,
plugins: navigator.plugins.length,
ua: navigator.userAgent,
platform: navigator.platform,
langs: navigator.languages,
})""")
assert checks["webdriver"] is None or checks["webdriver"] is False
assert checks["hasChrome"] is True
assert checks["plugins"] >= 3
assert "HeadlessChrome" not in checks["ua"]
print("✓ All stealth checks passed")Common Pitfalls
| Pitfall | Fix |
|---|---|
route.continue_() not awaited | Always await route.continue_() — unawaited routes hang forever |
| CDP session on wrong target | Use page.context.new_cdp_session(page), not browser.new_cdp_session() |
navigator.webdriver still true | Stealth must be applied via addInitScript BEFORE goto() |
| Stealth plugin applied to page, not context | Apply stealth to the context for full coverage across all pages |
| HAR recording misses service workers | Set record_har_url_filter broadly; SW traffic needs CDP interception |
| Cloudflare JS challenge timeout | Increase wait_until="networkidle" timeout; add random pre-challenge delay |
| Blocking CSS/fonts on Cloudflare-protected sites | Do NOT block stylesheet or font resources — CF uses them as a bot signal |
| Datacenter IP blocked on first request | Residential proxy is non-negotiable for sites with strict IP reputation checks |
Anti-Patterns to Avoid
- Navigating before applying stealth patches — bot detectors see the real fingerprint on first load
- Calling
route.continue_()withoutawait— the request hangs indefinitely - Using Firefox or WebKit for CDP-heavy scrapers — CDP is Chromium-only
- Blocking CSS and fonts on Cloudflare-protected sites — resource loading patterns are a detection signal
- Using datacenter IPs against sites with IP reputation systems — switch to residential proxies
- Teleporting the mouse directly to elements — always move along a bezier curve first
- Opening a CDP session on
browserinstead ofpage.context - Forgetting
await context.close()before reading HAR — the file is incomplete until context closes - Hard-coding
time.sleep()instead ofjitter()— fixed delays are trivially fingerprinted
Download playwright-scraping-advanced Skill
This .skill file contains four complete reference documents covering every aspect of advanced Playwright scraping — network interception, CDP commands, stealth fingerprint patches, and anti-bot strategies — ready to load into Claude or any AI tool as expert context for your scraping questions.
Hosted by ZynU Host · host.zynu.net