Database-First Spider Management

ScrapAI stores spiders as database rows, not Python files. This architectural choice enables powerful management patterns impossible with file-based scrapers.

The Traditional Approach: File-Based Spiders

In a typical Scrapy project:

Traditional Scrapy Project

project/
├── spiders/
│   ├── bbc_spider.py
│   ├── cnn_spider.py
│   ├── nytimes_spider.py
│   ├── guardian_spider.py
│   └── ... 96 more files
├── settings.py
└── scrapy.cfg

Problems at scale:

No Central Inventory

Which spiders exist? Grep the filesystem. Which are active? Check each file.

Hard to Batch Update

Change DOWNLOAD_DELAY across 100 spiders? Edit 100 files or write a script.

No Metadata

When was this spider created? By whom? For what project? Add comments and hope.

Code Drift

5 developers write spiders in 5 different styles. No consistency, harder to review.

The ScrapAI Approach: Database-First

ScrapAI Database Schema

spiders
├── id (primary key)
├── name (unique, indexed)
├── allowed_domains (JSON)
├── start_urls (JSON)
├── source_url (text)
├── active (boolean)
├── project (string)
├── callbacks_config (JSON)
├── created_at (timestamp)
└── updated_at (timestamp)

spider_rules
├── id (primary key)
├── spider_id (foreign key)
├── allow_patterns (JSON)
├── deny_patterns (JSON)
├── callback (string)
├── follow (boolean)
└── priority (integer)

spider_settings
├── id (primary key)
├── spider_id (foreign key)
├── key (string)
├── value (string)
└── type (string)

One spider class, many configurations. The DatabaseSpider loads any config from the database at runtime.

Benefits of Database-First

1. Central Inventory

List all spiders with a single query:

CLI

./scrapai spiders list --project news

SQL

SELECT name, active, created_at, updated_at 
FROM spiders 
WHERE project = 'news'
ORDER BY created_at DESC;

Output:

Name            Active  Created              Updated
─────────────────────────────────────────────────────
bbc_co_uk       Yes     2026-01-15 10:23:11  2026-02-20 14:55:32
cnn_com         Yes     2026-01-15 11:10:45  2026-01-15 11:10:45
guardian_co_uk  No      2026-01-14 09:12:33  2026-02-18 16:22:10

No filesystem traversal, no parsing Python files, no guessing.

2. Batch Updates

Change settings across multiple spiders:

-- Increase delay for all news spiders
UPDATE spider_settings 
SET value = '2'
WHERE key = 'DOWNLOAD_DELAY' 
  AND spider_id IN (SELECT id FROM spiders WHERE project = 'news');

With files: Write a script to parse and edit 100 Python files. Hope regex doesn’t break syntax. With database: One SQL query. Done.

3. Rich Metadata

Every spider tracks:

Temporal Data
Project Organization
Activity Status

Temporal Queries

-- Spiders created in the last 30 days
SELECT name, created_at 
FROM spiders 
WHERE created_at > NOW() - INTERVAL '30 days';

-- Spiders not updated in 90 days (stale?)
SELECT name, updated_at 
FROM spiders 
WHERE updated_at < NOW() - INTERVAL '90 days';

Project Queries

-- Count spiders per project
SELECT project, COUNT(*) as spider_count
FROM spiders
GROUP BY project
ORDER BY spider_count DESC;

-- Output:
-- news: 45 spiders
-- ecommerce: 23 spiders
-- forums: 12 spiders

Activity Queries

-- Active vs inactive spiders
SELECT active, COUNT(*) as count
FROM spiders
GROUP BY active;

-- Output:
-- Active: 78
-- Inactive: 22

4. Import/Export as Data

Spiders are JSON-serializable data structures:

{
  "name": "bbc_co_uk",
  "source_url": "https://bbc.co.uk/",
  "allowed_domains": ["bbc.co.uk"],
  "start_urls": ["https://www.bbc.co.uk/"],
  "rules": [
    {
      "allow": ["/news/articles/.*"],
      "callback": "parse_article"
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 1
  }
}

Use cases:

Share spider configs between projects
Version control configs (not code)
Backup/restore spider configurations
Distribute configs to other teams

5. Consistency and Validation

All configs go through the same validation pipeline:

Validation via Pydantic

from pydantic import BaseModel, Field, validator
import re

class SpiderConfig(BaseModel):
    name: str = Field(pattern=r"^[a-zA-Z0-9_-]+$")  # Strict naming
    allowed_domains: list[str] = Field(min_items=1)
    start_urls: list[str] = Field(min_items=1)
    rules: list[RuleConfig] = []
    settings: dict = {}
    
    @validator("start_urls")
    def validate_urls(cls, urls):
        for url in urls:
            if not url.startswith(("http://", "https://")):
                raise ValueError(f"Invalid URL: {url}")
            # Block private IPs
            if any(ip in url for ip in ["127.0.0.1", "localhost", "10.", "192.168."]):
                raise ValueError(f"Private IP blocked: {url}")
        return urls

Result: Uniform structure, validated inputs, no malformed configs.

Database Schema Deep Dive

Spider Table

integer

required

Primary key, auto-increment

name

string

required

Unique spider identifier. Must match ^[a-zA-Z0-9_-]+$.Example: bbc_co_uk, amazon_electronics

allowed_domains

JSON

required

List of domains the spider can crawl. Scrapy’s domain restriction.Example: ["bbc.co.uk", "www.bbc.co.uk"]

start_urls

JSON

required

Initial URLs to start crawling from.Example: ["https://www.bbc.co.uk/news"]

source_url

string

The original URL provided by the user when creating the spider.Example: https://bbc.co.uk/

active

boolean

default:"true"

Enable/disable spider without deletion. Inactive spiders are skipped.

project

string

default:"default"

Project grouping for multi-project setups.Example: news, ecommerce, research

callbacks_config

JSON

Custom callback definitions for non-article content (products, jobs, listings).

{
  "parse_product": {
    "extract": {
      "name": {"css": "h1.product-name::text"},
      "price": {"css": "span.price::text"}
    }
  }
}

created_at

timestamp

When the spider was created (auto-set).

updated_at

timestamp

Last update timestamp (auto-updated on change).

SpiderRule Table

Maps to Scrapy’s Rule and LinkExtractor:

Scrapy Equivalent

Rule(
    LinkExtractor(
        allow=[r"/news/articles/.*"],
        deny=[r"/news/.*#comments"]
    ),
    callback="parse_article",
    follow=False
)

Database representation:

allow_patterns

JSON

Regex patterns to match URLs for crawling.Example: ["/news/articles/.*", "/sport/.*/articles/.*"]

deny_patterns

JSON

Regex patterns to exclude URLs.Example: ["/news/.*#comments", "/gallery/.*"]

callback

string

Callback method name (parse_article, parse_product, etc.) or null for link following only.

boolean

default:"true"

Whether to follow links matched by this rule.

priority

integer

default:"0"

Rule execution order (higher priority first).

SpiderSetting Table

Key-value pairs for Scrapy settings:

Common Settings
Advanced Settings

Key	Value Example	Description
`EXTRACTOR_ORDER`	`["newspaper", "trafilatura"]`	Extraction fallback order
`DOWNLOAD_DELAY`	`2`	Seconds between requests
`CONCURRENT_REQUESTS`	`16`	Parallel request limit
`CLOUDFLARE_ENABLED`	`true`	Enable Cloudflare bypass
`ROBOTSTXT_OBEY`	`true`	Respect robots.txt

Key	Value Example	Description
`CUSTOM_SELECTORS`	`{"title": "h1.title"}`	Custom CSS selectors
`PLAYWRIGHT_WAIT_SELECTOR`	`div.content`	Wait for element before extraction
`PLAYWRIGHT_DELAY`	`3`	Additional wait time (seconds)
`INFINITE_SCROLL`	`true`	Enable infinite scroll
`MAX_SCROLLS`	`10`	Max scroll iterations

Querying the Database

ScrapAI provides a safe SQL query interface:

Read-Only Queries

# List all spiders
./scrapai db query "SELECT name, project, active FROM spiders"

# Count items per spider
./scrapai db query "SELECT spider_id, COUNT(*) FROM scraped_items GROUP BY spider_id"

# Find spiders with Cloudflare enabled
./scrapai db query "SELECT s.name FROM spiders s JOIN spider_settings ss ON s.id = ss.spider_id WHERE ss.key = 'CLOUDFLARE_ENABLED' AND ss.value = 'true'"

UPDATE/DELETE queries require confirmation via --confirm flag with row count to prevent accidental data loss.

./scrapai db query "DELETE FROM spiders WHERE name = 'test_spider'" --confirm 1

Real-World Patterns

Pattern 1: Fleet Health Check

Run test crawls on all spiders monthly:

Test All Spiders

# Get active spiders
./scrapai db query "SELECT name FROM spiders WHERE active = true AND project = 'news'" > active_spiders.txt

# Test each spider (5 sample URLs)
while read spider; do
  echo "Testing $spider..."
  ./scrapai crawl "$spider" --project news --limit 5
done < active_spiders.txt

# Check for extraction failures
./scrapai db query "SELECT spider_id, COUNT(*) FROM scraped_items WHERE scraped_at > NOW() - INTERVAL '1 day' GROUP BY spider_id HAVING COUNT(*) = 0"

Pattern 2: Bulk Configuration Changes

Rate Limit All Spiders

-- Increase delay for aggressive sites
UPDATE spider_settings 
SET value = '3'
WHERE key = 'DOWNLOAD_DELAY' 
  AND spider_id IN (
    SELECT id FROM spiders 
    WHERE allowed_domains::text LIKE '%amazon%' 
       OR allowed_domains::text LIKE '%ebay%'
  );

Pattern 3: Spider Versioning

Export before making changes:

Backup Before Update

# Export current config
./scrapai spiders export bbc_co_uk --project news > bbc_v1.json

# Make changes (via agent or manual edit)
# ... modify rules, settings, etc.

# Test new version
./scrapai crawl bbc_co_uk --project news --limit 10

# Rollback if needed
./scrapai spiders import bbc_v1.json --project news

Pattern 4: Multi-Project Management

Project Isolation

# Create separate projects for different use cases
./scrapai spiders import news_spider.json --project production
./scrapai spiders import news_spider.json --project staging
./scrapai spiders import news_spider.json --project testing

# Each project has independent configs and data
./scrapai crawl news_spider --project production
./scrapai crawl news_spider --project testing --limit 5

PostgreSQL vs SQLite

SQLite (Default)
PostgreSQL (Production)

Best for:

Single-user development
Small to medium scale (< 100 spiders)
Simple deployment (no external database)

Performance:

Read-heavy: Excellent (with WAL mode)
Write-heavy: Good (sequential writes)
Concurrent access: Limited (single writer)

Configuration:

.env

DATABASE_URL=sqlite:///scrapai.db

ScrapAI auto-enables WAL mode for better concurrency:

cursor.execute("PRAGMA journal_mode=WAL")

Best for:

Multi-user teams
Large scale (100+ spiders)
High concurrency (parallel crawls)

Performance:

Read-heavy: Excellent
Write-heavy: Excellent
Concurrent access: Excellent (MVCC)

Configuration:

.env

DATABASE_URL=postgresql://user:password@localhost:5432/scrapai

Migration:

# Update .env with PostgreSQL URL
./scrapai db migrate
./scrapai db transfer sqlite:///scrapai.db

Key Takeaways

Spiders are Data

Not files, not code—structured data in a database. Query, update, export like any other dataset.

One Spider Class

The DatabaseSpider loads any config at runtime. No code generation, no Python files per site.

Rich Metadata

Track creation time, update time, project, activity status. Impossible with files.

Batch Operations

Change settings across 100 spiders with one SQL query. Update, disable, export in bulk.

Database-First Spider Management

The Traditional Approach: File-Based Spiders

No Central Inventory

Hard to Batch Update

No Metadata

Code Drift

The ScrapAI Approach: Database-First

Benefits of Database-First

1. Central Inventory

2. Batch Updates

3. Rich Metadata

4. Import/Export as Data

5. Consistency and Validation

Database Schema Deep Dive

Spider Table

SpiderRule Table

SpiderSetting Table

Querying the Database

Real-World Patterns

Pattern 1: Fleet Health Check

Pattern 2: Bulk Configuration Changes

Pattern 3: Spider Versioning

Pattern 4: Multi-Project Management

PostgreSQL vs SQLite

Key Takeaways

Spiders are Data

One Spider Class

Rich Metadata

Batch Operations

Next Steps

Spider Schema

CLI Reference

Documentation Index

​The Traditional Approach: File-Based Spiders

No Central Inventory

Hard to Batch Update

No Metadata

Code Drift

​The ScrapAI Approach: Database-First

​Benefits of Database-First

​1. Central Inventory

​2. Batch Updates

​3. Rich Metadata

​4. Import/Export as Data

​5. Consistency and Validation

​Database Schema Deep Dive

​Spider Table

​SpiderRule Table

​SpiderSetting Table

​Querying the Database

​Real-World Patterns

​Pattern 1: Fleet Health Check

​Pattern 2: Bulk Configuration Changes

​Pattern 3: Spider Versioning

​Pattern 4: Multi-Project Management

​PostgreSQL vs SQLite

​Key Takeaways

Spiders are Data

One Spider Class

Rich Metadata

Batch Operations

​Next Steps

Spider Schema

CLI Reference

The Traditional Approach: File-Based Spiders

The ScrapAI Approach: Database-First

Benefits of Database-First

1. Central Inventory

2. Batch Updates

3. Rich Metadata

4. Import/Export as Data

5. Consistency and Validation

Database Schema Deep Dive

Spider Table

SpiderRule Table

SpiderSetting Table

Querying the Database

Real-World Patterns

Pattern 1: Fleet Health Check

Pattern 2: Bulk Configuration Changes

Pattern 3: Spider Versioning

Pattern 4: Multi-Project Management

PostgreSQL vs SQLite

Key Takeaways

Next Steps