Only enable Cloudflare bypass when the site explicitly requires it. Always test WITHOUTDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/discourselab/scrapai-cli/llms.txt
Use this file to discover all available pages before exploring further.
--cloudflare first.
Detection Indicators
Your site needs Cloudflare bypass if you see:Display Requirements
Cloudflare bypass requires a visible browser (not headless). Cloudflare detects and blocks headless browsers.
- Windows: Uses native display automatically ✓
- macOS: Uses native display automatically ✓
- Linux desktop: Uses native display automatically ✓
- Linux servers (VPS without GUI): Auto-detects missing display and uses Xvfb (virtual display) ✓
Inspector Usage
Strategies
Hybrid Mode (Recommended)
Browser verification once per 10 minutes, then fast HTTP with cached cookies. 20-100x faster than browser-only mode.Do NOT set
CONCURRENT_REQUESTS - uses Scrapy default of 16 for optimal performance.spider.json
- Browser verifies Cloudflare once and caches cookies
- Subsequent requests use fast HTTP with cached cookies
- Auto-refreshes cookies every 10 minutes
- Falls back to browser if cookies become invalid
Browser-Only Mode (Legacy)
spider.json
Settings Reference
| Setting | Default | Description |
|---|---|---|
CLOUDFLARE_ENABLED | false | Enable CF bypass |
CLOUDFLARE_STRATEGY | ”hybrid" | "hybrid” or “browser_only” |
CLOUDFLARE_COOKIE_REFRESH_THRESHOLD | 600 | Seconds before cookie refresh |
CF_MAX_RETRIES | 5 | Max verification attempts |
CF_RETRY_INTERVAL | 1 | Seconds between retries |
CF_POST_DELAY | 5 | Seconds after successful verification |
CF_WAIT_SELECTOR | — | CSS selector to wait for before extracting |
CF_WAIT_TIMEOUT | 10 | Max seconds to wait for selector |
CF_PAGE_TIMEOUT | 120000 | Page navigation timeout (ms) |
CONCURRENT_REQUESTS | — | Must be 1 for browser-only mode |
Complete Spider Example
spider.json
Timeouts & Hang Prevention
Browser operation timeout: 300 seconds (5 minutes) per operation to prevent infinite hangs.
TimeoutError instead of hanging forever. This protects against:
- Browser subprocess hangs
- Network stalls
- Infinite CF challenge loops
- Cross-thread asyncio deadlocks
- CF verification: 10-60 seconds
- Page load: 5-30 seconds
- Cookie refresh: 10-30 seconds
Troubleshooting
Crawl Hangs at “Getting/refreshing CF cookies”
Symptoms: Browser opens but never navigates. Logs show “Getting/refreshing CF cookies” but no progress. Possible causes:- Asyncio event loop mismatch (fixed in latest version)
- Browser subprocess issues - Chrome/nodriver incompatible with thread-based event loop
- Display/X11 issues on Linux servers
- Network/firewall blocking browser traffic
Works on One Machine But Not Another
Environmental factors affecting browser subprocesses:- Python/asyncio version differences
- Display environment (X11 vs Wayland vs headless)
- Chrome/Chromium version and availability
- System resources and timing (race conditions)
- Network conditions (DNS, latency, firewalls)
- Security software interfering with browser
Diagnosing via Logs
Hybrid mode indicators:Title Contamination
Related Guides
Proxy Escalation
Combine with smart proxy usage
Checkpoint Resume
Pause and resume long crawls