ScrapAI stores spiders as database rows, not Python files. This architectural choice enables powerful management patterns impossible with file-based scrapers.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/discourselab/scrapai-cli/llms.txt
Use this file to discover all available pages before exploring further.
The Traditional Approach: File-Based Spiders
In a typical Scrapy project:Traditional Scrapy Project
No Central Inventory
Which spiders exist? Grep the filesystem. Which are active? Check each file.
Hard to Batch Update
Change DOWNLOAD_DELAY across 100 spiders? Edit 100 files or write a script.
No Metadata
When was this spider created? By whom? For what project? Add comments and hope.
Code Drift
5 developers write spiders in 5 different styles. No consistency, harder to review.
The ScrapAI Approach: Database-First
ScrapAI Database Schema
One spider class, many configurations. The
DatabaseSpider loads any config from the database at runtime.Benefits of Database-First
1. Central Inventory
List all spiders with a single query:CLI
SQL
2. Batch Updates
Change settings across multiple spiders:3. Rich Metadata
Every spider tracks:- Temporal Data
- Project Organization
- Activity Status
Temporal Queries
4. Import/Export as Data
Spiders are JSON-serializable data structures:- Share spider configs between projects
- Version control configs (not code)
- Backup/restore spider configurations
- Distribute configs to other teams
5. Consistency and Validation
All configs go through the same validation pipeline:Validation via Pydantic
Database Schema Deep Dive
Spider Table
Primary key, auto-increment
Unique spider identifier. Must match
^[a-zA-Z0-9_-]+$.Example: bbc_co_uk, amazon_electronicsList of domains the spider can crawl. Scrapy’s domain restriction.Example:
["bbc.co.uk", "www.bbc.co.uk"]Initial URLs to start crawling from.Example:
["https://www.bbc.co.uk/news"]The original URL provided by the user when creating the spider.Example:
https://bbc.co.uk/Enable/disable spider without deletion. Inactive spiders are skipped.
Project grouping for multi-project setups.Example:
news, ecommerce, researchCustom callback definitions for non-article content (products, jobs, listings).
When the spider was created (auto-set).
Last update timestamp (auto-updated on change).
SpiderRule Table
Maps to Scrapy’sRule and LinkExtractor:
Scrapy Equivalent
Regex patterns to match URLs for crawling.Example:
["/news/articles/.*", "/sport/.*/articles/.*"]Regex patterns to exclude URLs.Example:
["/news/.*#comments", "/gallery/.*"]Callback method name (
parse_article, parse_product, etc.) or null for link following only.Whether to follow links matched by this rule.
Rule execution order (higher priority first).
SpiderSetting Table
Key-value pairs for Scrapy settings:- Common Settings
- Advanced Settings
| Key | Value Example | Description |
|---|---|---|
EXTRACTOR_ORDER | ["newspaper", "trafilatura"] | Extraction fallback order |
DOWNLOAD_DELAY | 2 | Seconds between requests |
CONCURRENT_REQUESTS | 16 | Parallel request limit |
CLOUDFLARE_ENABLED | true | Enable Cloudflare bypass |
ROBOTSTXT_OBEY | true | Respect robots.txt |
Querying the Database
ScrapAI provides a safe SQL query interface:Read-Only Queries
Real-World Patterns
Pattern 1: Fleet Health Check
Run test crawls on all spiders monthly:Test All Spiders
Pattern 2: Bulk Configuration Changes
Rate Limit All Spiders
Pattern 3: Spider Versioning
Export before making changes:Backup Before Update
Pattern 4: Multi-Project Management
Project Isolation
PostgreSQL vs SQLite
- SQLite (Default)
- PostgreSQL (Production)
Best for:ScrapAI auto-enables WAL mode for better concurrency:
- Single-user development
- Small to medium scale (< 100 spiders)
- Simple deployment (no external database)
- Read-heavy: Excellent (with WAL mode)
- Write-heavy: Good (sequential writes)
- Concurrent access: Limited (single writer)
.env
Key Takeaways
Spiders are Data
Not files, not code—structured data in a database. Query, update, export like any other dataset.
One Spider Class
The
DatabaseSpider loads any config at runtime. No code generation, no Python files per site.Rich Metadata
Track creation time, update time, project, activity status. Impossible with files.
Batch Operations
Change settings across 100 spiders with one SQL query. Update, disable, export in bulk.
Next Steps
Spider Schema
Detailed schema reference for Spider, SpiderRule, and SpiderSetting
CLI Reference
Commands for spider management: list, import, export, delete