Apache Airflow Integration

Schedule and monitor ScrapAI spiders at scale with Apache Airflow. Each spider becomes a DAG with automatic discovery, project-based organization, and optional S3 upload.

Overview

The Airflow integration provides:

Automatic DAG generation from your spider database
Project-based organization with filtering and access control
Scheduled crawls with configurable intervals
Real-time monitoring with logs and execution history
S3 upload with gzip compression (optional)

Architecture

┌─────────────────────┐
│   Airflow Web UI    │  Port 8080
│   (Browse/Trigger)  │
└──────────┬──────────┘
           │
┌──────────▼──────────┐
│  Airflow Scheduler  │  Reads DAG files
│  (Manages Schedule) │  every few minutes
└──────────┬──────────┘
           │
┌──────────▼──────────┐
│  DAG Generator      │  Queries ScrapAI DB
│  (Python script)    │  Generates DAGs dynamically
└──────────┬──────────┘
           │
┌──────────▼──────────┐
│  ScrapAI Database   │  Your spider configs
│  (PostgreSQL)       │
└─────────────────────┘
           │
┌──────────▼──────────┐
│  Bash Operator      │  Executes:
│  (Run Task)         │  ./scrapai crawl {name}
└─────────────────────┘

Quick Start

1. Configure Environment

Add to your .env file:

# Airflow admin credentials
_AIRFLOW_WWW_USER_USERNAME=admin
_AIRFLOW_WWW_USER_PASSWORD=your_secure_password

# Set Airflow UID to match your user
AIRFLOW_UID=$(id -u)

# Database connection (must match your ScrapAI database)
DB_HOST=host.docker.internal
DB_PORT=5432
DB_NAME=scrapai
DB_USER=postgres
DB_PASSWORD=your_password

2. Start Airflow

docker compose -f docker-compose.airflow.yml up -d

Wait 1-2 minutes for initialization.

3. Access Web UI

Open http://localhost:8080 and log in with your credentials. You’ll see DAGs for each spider in your database, named {project}_{spider_name}.

DAG Generation

DAGs are generated dynamically from your spider database. The generator runs on scheduler refresh (every few minutes).

DAG Naming Convention

Pattern: {project}_{spider_name} Examples:

news_bbc_co_uk
climate_team_climate_news
default_example_spider (if no project set)

DAG Configuration

Each DAG includes:

dag = DAG(
    dag_id=f"{project}_{spider_name}",
    schedule_interval=None,  # Manual triggering by default
    tags=['scrapai', f'project:{project}', 'spider'],
    catchup=False,
    max_active_runs=1,  # Prevent concurrent runs
)

Task Structure

Each DAG has 2-3 tasks:

crawl_spider: Runs ./scrapai crawl {spider_name} --timeout 28800
- 8-hour graceful timeout
- 9-hour hard kill as fallback
verify_results: Runs ./scrapai show {spider_name} --limit 5
- Verifies data was extracted
- Shows sample of results
upload_to_s3 (optional): Compresses and uploads to S3
- Only runs if S3 credentials are configured
- Gzip compression before upload
- Preserves folder structure

Scheduling Spiders

By default, spiders have no schedule (manual triggering only). To add scheduling:

Option 1: Database Column

Add a schedule_interval column to your spiders table:

ALTER TABLE spiders ADD COLUMN schedule_interval VARCHAR(50);

-- Set daily schedule for a spider
UPDATE spiders SET schedule_interval = '0 0 * * *' WHERE name = 'bbc_co_uk';

Option 2: Edit DAG Generator

Modify airflow/dags/scrapai_spider_dags.py:

# Custom schedule logic
if spider.name.startswith('news_'):
    schedule_interval = '@daily'
elif spider.name.startswith('research_'):
    schedule_interval = '@weekly'
else:
    schedule_interval = None

Common Schedules

Interval	Cron Expression	Description
`@hourly`	`0 * * * *`	Every hour at minute 0
`@daily`	`0 0 * * *`	Daily at midnight
`@weekly`	`0 0 * * 0`	Weekly on Sunday
Custom	`0 /6 * *`	Every 6 hours
Custom	`0 9 * * 1-5`	Weekdays at 9am

Project-Based Organization

Filtering by Project

Go to Airflow UI → DAGs page
Click a project tag: project:your_project_name
See only that project’s spiders

Environment Variable Filter

Limit which projects appear in Airflow:

# In .env
AIRFLOW_PROJECT_FILTER=news,research,climate

Only spiders from those projects will generate DAGs.

Triggering Crawls

Via Web UI

Go to DAGs page
Find your spider DAG
Click the “Play” button (▶)
Monitor progress in real-time

Via CLI

# Trigger a specific spider
docker compose -f docker-compose.airflow.yml exec airflow-webserver \
  airflow dags trigger {project}_{spider_name}

# Example
docker compose -f docker-compose.airflow.yml exec airflow-webserver \
  airflow dags trigger news_bbc_co_uk

Via REST API

curl -X POST \
  http://localhost:8080/api/v1/dags/{project}_{spider_name}/dagRuns \
  -H "Content-Type: application/json" \
  -u "admin:your_password" \
  -d '{"conf": {}}'

Monitoring

View Execution Logs

Click DAG name
Select a DAG run (date/time)
Click task (green/red box)
Click “Log” button

Execution History

Each DAG shows:

Last run status (success/fail)
Run duration
Success rate over time
Task dependencies graph

Stats Available

Duration: How long each crawl took
Records scraped: From verify task output
Failures: Which spiders are broken
Trends: Performance over time

S3 Integration

Upload crawl results to S3-compatible storage with automatic gzip compression.

Configuration

Add to .env:

S3_ACCESS_KEY=your_access_key
S3_SECRET_KEY=your_secret_key
S3_ENDPOINT=https://s3.amazonaws.com
S3_BUCKET=scrapai-crawls

The DAG generator automatically enables S3 upload if all credentials are present.

Upload Behavior

From airflow/dags/scrapai_spider_dags.py:61-139:

def upload_to_s3(spider_name: str, **context):
    # Find latest crawl file
    data_dir = SCRAPAI_PATH / 'data' / spider_name
    crawl_files = sorted(glob(str(data_dir / '**' / 'crawl_*.jsonl'), recursive=True), reverse=True)
    latest_file = crawl_files[0]
    
    # Compress to .jsonl.gz
    with open(latest_path, 'rb') as f_in:
        with gzip.open(gz_path, 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)
    
    # Preserve folder structure: spider_name/date/filename.gz
    relative_path = gz_path.relative_to(SCRAPAI_PATH / 'data')
    s3_key = str(relative_path)
    
    # Upload
    s3_client.upload_file(str(gz_path), s3_bucket, s3_key)
    
    # Clean up local files after successful upload
    gz_path.unlink()
    latest_path.unlink()

Compression savings: Typically 70-90% for JSONL text data. S3 path structure: s3://bucket/spider_name/YYYY-MM-DD/crawl_HHMMSS.jsonl.gz

Access Control (RBAC)

Creating Project-Specific Roles

Go to Security → List Roles
Click ”+” to add new role
Name: project_news_admin
Select permissions:
- can_read on DAG:news_*
- can_edit on DAG:news_*
- can_trigger on DAG:news_*

Creating Users

Go to Security → List Users
Click ”+” to add new user
Assign role: project_news_admin

Permission Levels

Role	Can View	Can Trigger	Can Edit	Can Delete
Admin	All DAGs	Yes	Yes	Yes
Project Admin	Project DAGs	Yes	Yes	Yes
Project User	Project DAGs	Yes	Yes	No
Viewer	Project DAGs	No	No	No

Programmatic Access Control

Uncomment in airflow/dags/scrapai_spider_dags.py:193-196:

dag = DAG(
    # ... other settings ...
    access_control={
        f'{project}_admin': {'can_read', 'can_edit', 'can_delete'},
        f'{project}_user': {'can_read', 'can_edit'},
    },
)

Then create matching roles in Airflow UI.

Alerting

Email Notifications

Edit DEFAULT_DAG_ARGS in scrapai_spider_dags.py:50-58:

DEFAULT_DAG_ARGS = {
    'owner': 'scrapai',
    'email': ['your-email@example.com'],
    'email_on_failure': True,
    'email_on_retry': False,
    # ... other settings ...
}

Configure SMTP

Add to docker-compose.airflow.yml environment:

AIRFLOW__SMTP__SMTP_HOST: smtp.gmail.com
AIRFLOW__SMTP__SMTP_PORT: 587
AIRFLOW__SMTP__SMTP_USER: your-email@gmail.com
AIRFLOW__SMTP__SMTP_PASSWORD: your-app-password
AIRFLOW__SMTP__SMTP_MAIL_FROM: your-email@gmail.com

Custom Alerts

Add custom task after verify:

notify_task = BashOperator(
    task_id='send_notification',
    bash_command=f'curl -X POST https://your-webhook.com/notify \\
        -d "{{\"spider\": \"{spider.name}\", \"status\": \"complete\"}}"',
)

crawl_task >> verify_task >> notify_task

Management Commands

# Start Airflow
docker compose -f docker-compose.airflow.yml up -d

# Stop Airflow
docker compose -f docker-compose.airflow.yml down

# View logs
docker compose -f docker-compose.airflow.yml logs -f airflow-scheduler

# Restart scheduler (to pick up DAG changes)
docker compose -f docker-compose.airflow.yml restart airflow-scheduler

# List all DAGs
docker compose -f docker-compose.airflow.yml exec airflow-webserver \
  airflow dags list

# Pause/unpause DAG
docker compose -f docker-compose.airflow.yml exec airflow-webserver \
  airflow dags pause {dag_id}

docker compose -f docker-compose.airflow.yml exec airflow-webserver \
  airflow dags unpause {dag_id}

# Reset everything (WARNING: deletes all Airflow data)
docker compose -f docker-compose.airflow.yml down -v

Troubleshooting

DAGs Not Showing Up

Check DAG file for errors:

docker compose -f docker-compose.airflow.yml exec airflow-webserver \
  python /opt/airflow/dags/scrapai_spider_dags.py

Check scheduler logs:

docker compose -f docker-compose.airflow.yml logs airflow-scheduler

Verify database connection:

docker compose -f docker-compose.airflow.yml exec airflow-webserver \
  python -c "from core.db import SessionLocal; print(SessionLocal())"

Spider Crawls Failing

Check task logs in Airflow UI:

Click failed task (red box)
Click “Log” button
Look for error messages

Test spider manually:

# SSH into container
docker compose -f docker-compose.airflow.yml exec airflow-webserver bash

# Try running spider
cd /opt/scrapai
source .venv/bin/activate
./scrapai crawl {spider_name} --project {project}

Database Connection Issues

Use host.docker.internal instead of localhost:

# In .env
DB_HOST=host.docker.internal

Test connectivity from container:

docker compose -f docker-compose.airflow.yml exec airflow-webserver \
  ping -c 3 host.docker.internal

Best Practices

Resource Management

Set max_active_runs=1 to prevent concurrent runs
Use execution_timeout to prevent runaway tasks
Monitor memory usage for large crawls

Scheduling Strategy

High-frequency sites (news): @hourly or 0 */6 * * *
Daily updates: @daily (midnight) or 0 9 * * * (9am)
Weekly archives: 0 0 * * 0 (Sunday midnight)
Manual only: None (on-demand triggering)

Monitoring

Set up email alerts for failures
Review execution times weekly
Check success rates for broken spiders
Monitor S3 storage growth

Parallel Crawling

Run multiple spiders simultaneously with GNU parallel

Security

Security validation and agent safety features

Documentation Index

​Overview

​Architecture

​Quick Start

​1. Configure Environment

​2. Start Airflow

​3. Access Web UI

​DAG Generation

​DAG Naming Convention

​DAG Configuration

​Task Structure

​Scheduling Spiders

​Option 1: Database Column

​Option 2: Edit DAG Generator

​Common Schedules

​Project-Based Organization

​Filtering by Project

​Environment Variable Filter

​Triggering Crawls

​Via Web UI

​Via CLI

​Via REST API

​Monitoring

​View Execution Logs

​Execution History

​Stats Available

​S3 Integration

​Configuration

​Upload Behavior

​Access Control (RBAC)

​Creating Project-Specific Roles

​Creating Users

​Permission Levels

​Programmatic Access Control

​Alerting

​Email Notifications

​Configure SMTP

​Custom Alerts

​Management Commands

​Troubleshooting

​DAGs Not Showing Up

​Spider Crawls Failing

​Database Connection Issues

​Best Practices

​Resource Management

​Scheduling Strategy

​Monitoring

​See Also

Parallel Crawling

Security

Overview

Architecture

Quick Start

1. Configure Environment

2. Start Airflow

3. Access Web UI

DAG Generation

DAG Naming Convention

DAG Configuration

Task Structure

Scheduling Spiders

Option 1: Database Column

Option 2: Edit DAG Generator

Common Schedules

Project-Based Organization

Filtering by Project

Environment Variable Filter

Triggering Crawls

Via Web UI

Via CLI

Via REST API

Monitoring

View Execution Logs

Execution History

Stats Available

S3 Integration

Configuration

Upload Behavior

Access Control (RBAC)

Creating Project-Specific Roles

Creating Users

Permission Levels

Programmatic Access Control

Alerting

Email Notifications

Configure SMTP

Custom Alerts

Management Commands

Troubleshooting

DAGs Not Showing Up

Spider Crawls Failing

Database Connection Issues

Best Practices

Resource Management

Scheduling Strategy

Monitoring

See Also