Documentation Index Fetch the complete documentation index at: https://mintlify.com/discourselab/scrapai-cli/llms.txt
Use this file to discover all available pages before exploring further.
Crawl hundreds of websites in parallel with automatic resource management and intelligent parallelism detection.
Overview
The parallel-crawl script uses GNU parallel to run multiple ScrapAI spiders concurrently. It automatically detects system resources (CPU cores, available memory) and calculates optimal parallelism based on spider types (regular vs. Cloudflare-enabled).
Quick Start
Install GNU Parallel
Run All Spiders in Project
bin/parallel-crawl myproject
Run Specific Spiders
bin/parallel-crawl myproject spider1 spider2 spider3
How It Works
From bin/parallel-crawl:1-134:
#!/bin/bash
# Parallel crawler using GNU parallel
set -euo pipefail
PROJECT = " $1 "
shift
# Get spider list
if [ $# -eq 0 ]; then
SPIDERS = $( ./scrapai spiders list --project " $PROJECT " | grep '•' | awk '{print $2}' )
else
SPIDERS = " $@ "
fi
# Count Cloudflare-enabled spiders
CF_COUNT = $( python3 -c "
from core.db import get_db
from core.models import Spider
db = next(get_db())
names = sys.argv[1:]
count = 0
for name in names:
spider = db.query(Spider).filter(Spider.name == name).first()
if spider:
for s in spider.settings:
if s.key == 'CLOUDFLARE_ENABLED' and str(s.value).lower() in ('true', '1'):
count += 1
break
print(count)
" $SPIDERS )
# Auto-detect parallelism from system resources
CPU_CORES = $( nproc 2> /dev/null || sysctl -n hw.ncpu 2> /dev/null || echo 4 )
AVAILABLE_MEM_MB = $( free -m | awk '/^Mem:/ {print $7}' )
# Memory per spider: regular 200MB, Cloudflare 500MB
if [ " $CF_COUNT " -eq 0 ]; then
MEM_PER_SPIDER = 200
elif [ " $CF_COUNT " -eq " $SPIDER_COUNT " ]; then
MEM_PER_SPIDER = 500
else
MEM_PER_SPIDER = $(( ( REGULAR_COUNT * 200 + CF_COUNT * 500 ) / SPIDER_COUNT ))
fi
MEM_PARALLEL = $(( ( AVAILABLE_MEM_MB - 2048 ) / MEM_PER_SPIDER ))
CPU_PARALLEL = $(( CPU_CORES * 80 / 100 ))
PARALLEL = $(( MEM_PARALLEL < CPU_PARALLEL ? MEM_PARALLEL : CPU_PARALLEL ))
# Run crawls in parallel
echo " $SPIDERS " | tr ' ' '\n' | parallel \
-j " $PARALLEL " \
--timeout 8h \
--halt soon,fail=50% \
--line-buffer \
--tagstring "[{.}]" \
"./scrapai crawl {} --project $PROJECT "
Resource Calculation
Memory-Based Parallelism
The script allocates memory per spider type:
Regular spiders : 200 MB each
Cloudflare spiders : 500 MB each (browser automation overhead)
Mixed fleet : Weighted average
Formula:
AVAILABLE_MEM_MB = $( free -m | awk '/^Mem:/ {print $7}' )
MEM_PARALLEL = $(( ( AVAILABLE_MEM_MB - 2048 ) / MEM_PER_SPIDER ))
Reserves 2GB for system, divides remaining memory by per-spider allocation.
CPU-Based Parallelism
CPU_CORES = $( nproc )
CPU_PARALLEL = $(( CPU_CORES * 80 / 100 ))
Uses 80% of available cores to avoid saturating the system.
Final Parallelism
PARALLEL = $(( MEM_PARALLEL < CPU_PARALLEL ? MEM_PARALLEL : CPU_PARALLEL ))
[ " $PARALLEL " -lt 2 ] && PARALLEL = 2
[ " $PARALLEL " -gt 20 ] && PARALLEL = 20
Takes the minimum of memory-based and CPU-based limits, clamped between 2 and 20.
Example Output
$ bin/parallel-crawl news
==========================================
Parallel Crawler
==========================================
Project: news
Spiders: 47 (12 CF + 35 regular )
Parallel: 8 jobs
Timeout: 8h per spider
==========================================
Continue? (y/N): y
Starting parallel crawl...
[bbc_co_uk] Starting crawl...
[guardian] Starting crawl...
[reuters] Starting crawl...
[cnn] Starting crawl...
[bbc_co_uk] ✓ Crawled 1,247 pages
[guardian] ✓ Crawled 892 pages
[ap_news] Starting crawl...
[reuters] ✓ Crawled 2,103 pages
...
Advanced Usage
Custom Parallelism
Override auto-detection:
# Force 4 parallel jobs
echo " $SPIDERS " | tr ' ' '\n' | parallel -j 4 \
"./scrapai crawl {} --project myproject"
Timeout Control
# 2-hour timeout per spider
parallel --timeout 2h ...
# No timeout (dangerous for stuck spiders)
parallel --timeout 0 ...
Failure Handling
From bin/parallel-crawl:127:
Stops all jobs if 50% or more fail. Prevents wasting resources on broken configuration.
Other halt strategies:
--halt now,fail= 1 # Stop immediately on first failure
--halt soon,fail=10% # Stop if 10% fail
--halt never # Continue even if all fail
Progress Monitoring
# Add progress bar
parallel --progress ...
# Show ETA
parallel --eta ...
# Both
parallel --progress --eta ...
Job Log
# Log all job completions
parallel --joblog crawl_log.txt ...
# Resume from log (skip completed jobs)
parallel --joblog crawl_log.txt --resume ...
Resource Management
Memory Limits
Why 200MB for regular spiders?
Scrapy framework: ~50 MB
Downloaded pages in memory: ~100 MB
Extraction libraries: ~50 MB
Why 500MB for Cloudflare spiders?
Above base: 200 MB
Browser process (Chromium): ~200 MB
Rendering overhead: ~100 MB
CPU Scheduling
GNU parallel uses fair CPU scheduling:
Jobs share CPU time equally
I/O-bound tasks (most scrapers) yield CPU automatically
Network-bound tasks have minimal CPU impact
Disk I/O
Each spider writes to separate output file:
data/{spider_name}/YYYY-MM-DD/crawl_HHMMSS.jsonl
No I/O contention between spiders.
Patterns and Best Practices
Small Fleet (< 10 spiders)
# Just run them all
bin/parallel-crawl myproject
Auto-detection handles everything.
Medium Fleet (10-50 spiders)
# Prioritize by importance
bin/parallel-crawl myproject high_priority_1 high_priority_2 ...
# Then run the rest
bin/parallel-crawl myproject
Large Fleet (50+ spiders)
Split by type:
# Run Cloudflare spiders first (slower, memory-intensive)
bin/parallel-crawl myproject $( ./scrapai spiders list --project myproject | \
grep -i cloudflare | awk '{print $2}' )
# Then run regular spiders
bin/parallel-crawl myproject $( ./scrapai spiders list --project myproject | \
grep -v cloudflare | awk '{print $2}' )
Split by schedule:
# Morning batch (9am cron)
bin/parallel-crawl news bbc guardian cnn reuters
# Evening batch (9pm cron)
bin/parallel-crawl news nytimes wapo ft bloomberg
Memory-Constrained Systems
# Reduce parallelism
echo " $SPIDERS " | tr ' ' '\n' | parallel -j 2 ...
# Or run sequentially
for spider in $SPIDERS ; do
./scrapai crawl $spider --project myproject
done
Comparison with Airflow
Feature Parallel-Crawl Airflow Setup None (just GNU parallel) Docker + configuration Scheduling Cron jobs Built-in scheduler Monitoring Terminal output + logs Web UI + graphs Parallelism Auto-detected Manual configuration Retry logic Manual (rerun command) Automatic with backoff Use case Ad-hoc batch crawls Production scheduling
When to use parallel-crawl:
One-time crawls of many sites
Testing spider fleet
Resource-constrained environments
Simple cron-based scheduling
When to use Airflow:
Production deployments
Complex dependencies between spiders
Team collaboration
Historical execution tracking
Integration with Cron
Daily Crawl of All Spiders
# crontab -e
0 2 * * * cd /path/to/scrapai-cli && bin/parallel-crawl news >> logs/crawl.log 2>&1
Runs at 2am daily.
Weekday vs. Weekend
# Weekdays: full crawl
0 2 * * 1-5 cd /path/to/scrapai-cli && bin/parallel-crawl news
# Weekends: high-priority only
0 2 * * 0,6 cd /path/to/scrapai-cli && bin/parallel-crawl news priority1 priority2
Staggered Batches
# Batch 1: 2am
0 2 * * * cd /path/to/scrapai-cli && bin/parallel-crawl news $( cat batch1.txt )
# Batch 2: 8am
0 8 * * * cd /path/to/scrapai-cli && bin/parallel-crawl news $( cat batch2.txt )
# Batch 3: 2pm
0 14 * * * cd /path/to/scrapai-cli && bin/parallel-crawl news $( cat batch3.txt )
Troubleshooting
GNU Parallel Not Found
❌ GNU parallel is not installed
# Install on macOS
brew install parallel
# Install on Linux
sudo apt-get install parallel
Out of Memory Errors
Symptom : Spiders crash with “Killed” or OOM errors.
Solution : Reduce parallelism or split fleet.
# Check available memory
free -h
# Reduce parallelism
echo " $SPIDERS " | tr ' ' '\n' | parallel -j 2 ...
Some Spiders Timeout
Symptom : “SIGTERM” or timeout messages.
Solution : Increase timeout or exclude slow spiders.
# Increase timeout
parallel --timeout 12h ...
# Run slow spiders separately
bin/parallel-crawl news fast_spider1 fast_spider2
./scrapai crawl slow_spider --project news # Run alone
Jobs Not Starting
Check if parallel is actually running:
Check logs:
tail -f ~/.parallel/tmp/ *
See Also
Airflow Integration Production scheduling with Apache Airflow
Checkpoint Resume Pause and resume long crawls