Parallel Crawling

Crawl hundreds of websites in parallel with automatic resource management and intelligent parallelism detection.

Overview

The parallel-crawl script uses GNU parallel to run multiple ScrapAI spiders concurrently. It automatically detects system resources (CPU cores, available memory) and calculates optimal parallelism based on spider types (regular vs. Cloudflare-enabled).

Quick Start

Install GNU Parallel

brew install parallel

Run All Spiders in Project

bin/parallel-crawl myproject

Run Specific Spiders

bin/parallel-crawl myproject spider1 spider2 spider3

How It Works

From bin/parallel-crawl:1-134:

#!/bin/bash
# Parallel crawler using GNU parallel

set -euo pipefail

PROJECT="$1"
shift

# Get spider list
if [ $# -eq 0 ]; then
    SPIDERS=$(./scrapai spiders list --project "$PROJECT" | grep '•' | awk '{print $2}')
else
    SPIDERS="$@"
fi

# Count Cloudflare-enabled spiders
CF_COUNT=$(python3 -c "
from core.db import get_db
from core.models import Spider

db = next(get_db())
names = sys.argv[1:]
count = 0
for name in names:
    spider = db.query(Spider).filter(Spider.name == name).first()
    if spider:
        for s in spider.settings:
            if s.key == 'CLOUDFLARE_ENABLED' and str(s.value).lower() in ('true', '1'):
                count += 1
                break
print(count)
" $SPIDERS)

# Auto-detect parallelism from system resources
CPU_CORES=$(nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
AVAILABLE_MEM_MB=$(free -m | awk '/^Mem:/ {print $7}')

# Memory per spider: regular 200MB, Cloudflare 500MB
if [ "$CF_COUNT" -eq 0 ]; then
    MEM_PER_SPIDER=200
elif [ "$CF_COUNT" -eq "$SPIDER_COUNT" ]; then
    MEM_PER_SPIDER=500
else
    MEM_PER_SPIDER=$(( (REGULAR_COUNT * 200 + CF_COUNT * 500) / SPIDER_COUNT ))
fi

MEM_PARALLEL=$(( (AVAILABLE_MEM_MB - 2048) / MEM_PER_SPIDER ))
CPU_PARALLEL=$(( CPU_CORES * 80 / 100 ))
PARALLEL=$(( MEM_PARALLEL < CPU_PARALLEL ? MEM_PARALLEL : CPU_PARALLEL ))

# Run crawls in parallel
echo "$SPIDERS" | tr ' ' '\n' | parallel \
    -j "$PARALLEL" \
    --timeout 8h \
    --halt soon,fail=50% \
    --line-buffer \
    --tagstring "[{.}]" \
    "./scrapai crawl {} --project $PROJECT"

Resource Calculation

Memory-Based Parallelism

The script allocates memory per spider type:

Regular spiders: 200 MB each
Cloudflare spiders: 500 MB each (browser automation overhead)
Mixed fleet: Weighted average

Formula:

AVAILABLE_MEM_MB=$(free -m | awk '/^Mem:/ {print $7}')
MEM_PARALLEL=$(( (AVAILABLE_MEM_MB - 2048) / MEM_PER_SPIDER ))

Reserves 2GB for system, divides remaining memory by per-spider allocation.

CPU-Based Parallelism

CPU_CORES=$(nproc)
CPU_PARALLEL=$(( CPU_CORES * 80 / 100 ))

Uses 80% of available cores to avoid saturating the system.

Final Parallelism

PARALLEL=$(( MEM_PARALLEL < CPU_PARALLEL ? MEM_PARALLEL : CPU_PARALLEL ))
[ "$PARALLEL" -lt 2 ] && PARALLEL=2
[ "$PARALLEL" -gt 20 ] && PARALLEL=20

Takes the minimum of memory-based and CPU-based limits, clamped between 2 and 20.

Example Output

$ bin/parallel-crawl news

==========================================
Parallel Crawler
==========================================
Project:  news
Spiders:  47 (12 CF + 35 regular)
Parallel: 8 jobs
Timeout:  8h per spider
==========================================

Continue? (y/N): y

Starting parallel crawl...

[bbc_co_uk]  Starting crawl...
[guardian]   Starting crawl...
[reuters]    Starting crawl...
[cnn]        Starting crawl...
[bbc_co_uk]  ✓ Crawled 1,247 pages
[guardian]   ✓ Crawled 892 pages
[ap_news]    Starting crawl...
[reuters]    ✓ Crawled 2,103 pages
...

Advanced Usage

Custom Parallelism

Override auto-detection:

# Force 4 parallel jobs
echo "$SPIDERS" | tr ' ' '\n' | parallel -j 4 \
    "./scrapai crawl {} --project myproject"

Timeout Control

# 2-hour timeout per spider
parallel --timeout 2h ...

# No timeout (dangerous for stuck spiders)
parallel --timeout 0 ...

Failure Handling

From bin/parallel-crawl:127:

--halt soon,fail=50%

Stops all jobs if 50% or more fail. Prevents wasting resources on broken configuration. Other halt strategies:

--halt now,fail=1     # Stop immediately on first failure
--halt soon,fail=10%  # Stop if 10% fail
--halt never          # Continue even if all fail

Progress Monitoring

# Add progress bar
parallel --progress ...

# Show ETA
parallel --eta ...

# Both
parallel --progress --eta ...

Job Log

# Log all job completions
parallel --joblog crawl_log.txt ...

# Resume from log (skip completed jobs)
parallel --joblog crawl_log.txt --resume ...

Resource Management

Memory Limits

Why 200MB for regular spiders?

Scrapy framework: ~50 MB
Downloaded pages in memory: ~100 MB
Extraction libraries: ~50 MB

Why 500MB for Cloudflare spiders?

Above base: 200 MB
Browser process (Chromium): ~200 MB
Rendering overhead: ~100 MB

CPU Scheduling

GNU parallel uses fair CPU scheduling:

Jobs share CPU time equally
I/O-bound tasks (most scrapers) yield CPU automatically
Network-bound tasks have minimal CPU impact

Disk I/O

Each spider writes to separate output file:

data/{spider_name}/YYYY-MM-DD/crawl_HHMMSS.jsonl

No I/O contention between spiders.

Patterns and Best Practices

Small Fleet (< 10 spiders)

# Just run them all
bin/parallel-crawl myproject

Auto-detection handles everything.

Medium Fleet (10-50 spiders)

# Prioritize by importance
bin/parallel-crawl myproject high_priority_1 high_priority_2 ...

# Then run the rest
bin/parallel-crawl myproject

Large Fleet (50+ spiders)

Split by type:

# Run Cloudflare spiders first (slower, memory-intensive)
bin/parallel-crawl myproject $(./scrapai spiders list --project myproject | \
    grep -i cloudflare | awk '{print $2}')

# Then run regular spiders
bin/parallel-crawl myproject $(./scrapai spiders list --project myproject | \
    grep -v cloudflare | awk '{print $2}')

Split by schedule:

# Morning batch (9am cron)
bin/parallel-crawl news bbc guardian cnn reuters

# Evening batch (9pm cron)
bin/parallel-crawl news nytimes wapo ft bloomberg

Memory-Constrained Systems

# Reduce parallelism
echo "$SPIDERS" | tr ' ' '\n' | parallel -j 2 ...

# Or run sequentially
for spider in $SPIDERS; do
    ./scrapai crawl $spider --project myproject
done

Comparison with Airflow

Feature	Parallel-Crawl	Airflow
Setup	None (just GNU parallel)	Docker + configuration
Scheduling	Cron jobs	Built-in scheduler
Monitoring	Terminal output + logs	Web UI + graphs
Parallelism	Auto-detected	Manual configuration
Retry logic	Manual (rerun command)	Automatic with backoff
Use case	Ad-hoc batch crawls	Production scheduling

When to use parallel-crawl:

One-time crawls of many sites
Testing spider fleet
Resource-constrained environments
Simple cron-based scheduling

When to use Airflow:

Production deployments
Complex dependencies between spiders
Team collaboration
Historical execution tracking

Integration with Cron

Daily Crawl of All Spiders

# crontab -e
0 2 * * * cd /path/to/scrapai-cli && bin/parallel-crawl news >> logs/crawl.log 2>&1

Runs at 2am daily.

Weekday vs. Weekend

# Weekdays: full crawl
0 2 * * 1-5 cd /path/to/scrapai-cli && bin/parallel-crawl news

# Weekends: high-priority only
0 2 * * 0,6 cd /path/to/scrapai-cli && bin/parallel-crawl news priority1 priority2

Staggered Batches

# Batch 1: 2am
0 2 * * * cd /path/to/scrapai-cli && bin/parallel-crawl news $(cat batch1.txt)

# Batch 2: 8am
0 8 * * * cd /path/to/scrapai-cli && bin/parallel-crawl news $(cat batch2.txt)

# Batch 3: 2pm
0 14 * * * cd /path/to/scrapai-cli && bin/parallel-crawl news $(cat batch3.txt)

Troubleshooting

GNU Parallel Not Found

❌ GNU parallel is not installed

# Install on macOS
brew install parallel

# Install on Linux
sudo apt-get install parallel

Out of Memory Errors

Symptom: Spiders crash with “Killed” or OOM errors. Solution: Reduce parallelism or split fleet.

# Check available memory
free -h

# Reduce parallelism
echo "$SPIDERS" | tr ' ' '\n' | parallel -j 2 ...

Some Spiders Timeout

Symptom: “SIGTERM” or timeout messages. Solution: Increase timeout or exclude slow spiders.

# Increase timeout
parallel --timeout 12h ...

# Run slow spiders separately
bin/parallel-crawl news fast_spider1 fast_spider2
./scrapai crawl slow_spider --project news  # Run alone

Jobs Not Starting

Check if parallel is actually running:

ps aux | grep parallel

Check logs:

tail -f ~/.parallel/tmp/*

Airflow Integration

Production scheduling with Apache Airflow

Checkpoint Resume

Pause and resume long crawls

Documentation Index

​Overview

​Quick Start

​Install GNU Parallel

​Run All Spiders in Project

​Run Specific Spiders

​How It Works

​Resource Calculation

​Memory-Based Parallelism

​CPU-Based Parallelism

​Final Parallelism

​Example Output

​Advanced Usage

​Custom Parallelism

​Timeout Control

​Failure Handling

​Progress Monitoring

​Job Log

​Resource Management

​Memory Limits

​CPU Scheduling

​Disk I/O

​Patterns and Best Practices

​Small Fleet (< 10 spiders)

​Medium Fleet (10-50 spiders)

​Large Fleet (50+ spiders)

​Memory-Constrained Systems

​Comparison with Airflow

​Integration with Cron

​Daily Crawl of All Spiders

​Weekday vs. Weekend

​Staggered Batches

​Troubleshooting

​GNU Parallel Not Found

​Out of Memory Errors

​Some Spiders Timeout

​Jobs Not Starting

​See Also