Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/discourselab/scrapai-cli/llms.txt

Use this file to discover all available pages before exploring further.

Spider management commands handle JSON configurations stored in the database. Spiders are imported from JSON files and can be updated by re-importing.

spiders list

List all spiders in the database.

Syntax

./scrapai spiders list [--project <name>]

Options

--project
string
Filter by project name. If omitted, shows spiders from all projects.

Examples

# List all spiders across all projects
./scrapai spiders list

# List spiders in specific project
./scrapai spiders list --project news

Output

$ ./scrapai spiders list --project news
📋 Available Spiders (DB) - Project: news:
 bbc_co_uk [news] (Active: True) - Created: 2026-02-28 14:30, Updated: 2026-02-28 15:45
    Source: https://bbc.co.uk
 cnn_com [news] (Active: True) - Created: 2026-02-27 09:15, Updated: 2026-02-27 09:15
    Source: https://cnn.com
 reuters_com [news] (Active: True) - Created: 2026-02-26 16:20, Updated: 2026-02-28 11:30
    Source: https://reuters.com

Fields Displayed

  • Name: Spider identifier (used in crawl and show commands)
  • Project: Project tag in brackets
  • Active: Whether spider is enabled (currently always True)
  • Created: Initial import timestamp
  • Updated: Last modification timestamp
  • Source: Original website URL (if specified in config)

spiders import

Import or update a spider from a JSON configuration file.

Syntax

./scrapai spiders import <file> --project <name> [--skip-validation]

Arguments

file
string
required
Path to JSON spider configuration file. Use - to read from stdin.

Options

--project
string
default:"default"
Project name to associate with this spider.
--skip-validation
flag
Skip Pydantic schema validation (not recommended). Use only for backward compatibility.

Examples

# Import spider from file
./scrapai spiders import bbc_spider.json --project news

# Import from stdin (useful in pipelines)
cat spider.json | ./scrapai spiders import - --project news

# Skip validation (backward compatibility)
./scrapai spiders import old_spider.json --project legacy --skip-validation

Spider Configuration Format

{
  "name": "bbc_co_uk",
  "allowed_domains": ["bbc.co.uk"],
  "start_urls": ["https://www.bbc.co.uk/news"],
  "source_url": "https://bbc.co.uk",
  "rules": [
    {
      "allow": ["/news/articles/[^/]+$"],
      "callback": "parse_article",
      "follow": false,
      "priority": 10
    },
    {
      "allow": ["/news/?$"],
      "follow": true,
      "priority": 5
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 2,
    "CONCURRENT_REQUESTS": 8
  },
  "callbacks": {
    "parse_article": {
      "extract": {
        "title": {"css": "h1.article-headline::text"},
        "author": {"css": "span.author-name::text"},
        "content": {"css": "div.article-body", "get": "all_text"}
      }
    }
  }
}

Configuration Fields

name
string
required
Spider name (letters, numbers, hyphens, underscores only). Must be unique per project.
allowed_domains
array
required
List of domains this spider can crawl. URLs outside these domains are filtered.
start_urls
array
required
Initial URLs to crawl. Must be valid HTTP/HTTPS URLs.
source_url
string
Original website URL (for documentation purposes).
rules
array
URL pattern matching rules. Each rule defines which URLs to follow and how to process them.
settings
object
Spider-specific settings that override defaults.
callbacks
object
Custom extraction callbacks with CSS/XPath selectors for non-article content.

Validation

All spider configs are validated through Pydantic schemas before import:
  • Spider names: ^[a-zA-Z0-9_-]+$ pattern
  • URLs: HTTP/HTTPS only, no private IPs (127.0.0.1, 10.x, 172.16.x, 192.168.x), max 2048 chars
  • Callback names: Whitelisted names only, reserved names blocked
  • Settings: Bounded values (concurrency 1-32, delays 0-60s)
  • Extractor order: Valid extractor names only

Output

Successful Import

$ ./scrapai spiders import bbc_spider.json --project news
 Spider 'bbc_co_uk' imported successfully!
   Project: news
   Domains: bbc.co.uk
   Start URLs: 1
   Rules: 2
   Callbacks: 1 (parse_article)

Update Existing Spider

$ ./scrapai spiders import bbc_spider.json --project news
⚠️  Spider 'bbc_co_uk' already exists. Updating...
 Spider 'bbc_co_uk' imported successfully!
   Project: news
   Domains: bbc.co.uk
   Start URLs: 1
   Rules: 2
   Callbacks: 1 (parse_article)
Re-importing a spider replaces its configuration entirely. All rules and settings are deleted and recreated.

Validation Failure

$ ./scrapai spiders import bad_spider.json --project news
 Spider configuration validation failed:
 name: string does not match pattern "^[a-zA-Z0-9_-]+$"
 start_urls -> 0: URL scheme must be http or https
 settings -> CONCURRENT_REQUESTS: value must be between 1 and 32

💡 Use --skip-validation to bypass validation (not recommended)

spiders delete

Delete a spider and all its associated data.

Syntax

./scrapai spiders delete <name> [--project <name>] [--force]

Arguments

name
string
required
Spider name to delete.

Options

--project
string
Project name. If specified, only deletes spider from that project.
--force
flag
Skip confirmation prompt.

Examples

# Delete spider with confirmation
./scrapai spiders delete bbc_co_uk --project news

# Delete without confirmation
./scrapai spiders delete old_spider --project archive --force

Output

With Confirmation

$ ./scrapai spiders delete bbc_co_uk --project news
Are you sure you want to delete spider 'bbc_co_uk' in project 'news'? (y/N): y
🗑️  Spider 'bbc_co_uk' in project 'news' deleted!

Force Delete

$ ./scrapai spiders delete bbc_co_uk --project news --force
🗑️  Spider 'bbc_co_uk' in project 'news' deleted!
Deleting a spider removes:
  • Spider configuration
  • All URL matching rules
  • All custom settings
  • All scraped items associated with this spider
This operation cannot be undone.

Database Storage

Spiders are stored across multiple tables:

spiders Table

  • id: Primary key (auto-increment)
  • name: Spider name (unique per project)
  • project: Project name
  • allowed_domains: JSON array
  • start_urls: JSON array
  • source_url: Original website URL
  • active: Boolean (currently always true)
  • callbacks_config: JSON object with callback definitions
  • created_at: Timestamp
  • updated_at: Timestamp

spider_rules Table

  • spider_id: Foreign key to spiders
  • allow_patterns: JSON array of URL patterns to allow
  • deny_patterns: JSON array of URL patterns to deny
  • restrict_xpaths: JSON array of XPath restrictions
  • restrict_css: JSON array of CSS restrictions
  • callback: Callback function name
  • follow: Boolean (whether to follow links)
  • priority: Integer (higher = processed first)

spider_settings Table

  • spider_id: Foreign key to spiders
  • key: Setting name
  • value: Setting value (as string)
  • type: Value type (str, int, bool, json)

Working with Templates

ScrapAI includes example spider configs in templates/:
# Import example spider
./scrapai spiders import templates/bbc_spider.json --project examples

# View all templates
ls -la templates/*.json
Templates cover various site types:
  • News sites (BBC, Reuters)
  • E-commerce (product listings)
  • Forums (discussion threads)
  • Cloudflare-protected sites

Next Steps

Run Crawls

Start crawling with your imported spiders

View Data

Inspect and export scraped items