Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/discourselab/scrapai-cli/llms.txt

Use this file to discover all available pages before exploring further.

Rules control which URLs are followed and how they are processed. Each rule defines URL patterns and optional callbacks for extraction.

SpiderRuleSchema

allow
string[]
default:"null"
Regex patterns for URLs to allowValidation:
  • Must be list of non-empty strings
  • Patterns are Python regex
Example:
"allow": ["/news/articles/.*", "/blog/.*"]
deny
string[]
default:"null"
Regex patterns for URLs to deny (takes precedence over allow)Validation:
  • Must be list of non-empty strings
  • Patterns are Python regex
Example:
"deny": ["/news/articles/.*#comments", ".*\\?page=.*"]
restrict_xpaths
string[]
default:"null"
Only follow links found in these XPath expressionsExample:
"restrict_xpaths": ["//div[@class='main-content']//a"]
restrict_css
string[]
default:"null"
Only follow links found in these CSS selectorsExample:
"restrict_css": ["div.article-list a", "nav.pagination a"]
callback
string
default:"null"
Callback function name for processing matched URLsValidation:
  • Must be valid Python identifier (regex: ^[a-zA-Z_][a-zA-Z0-9_]*$)
  • Cannot be a reserved name
  • Must be defined in callbacks object (or use built-in parse_article)
Built-in callbacks:
  • parse_article - Extract article content using configured extractors
Reserved names (cannot use):
  • parse_article, parse_start_url, start_requests, from_crawler, closed, parse
Example:
"callback": "parse_product"
Use null for navigation-only rules (follow links but don’t extract).
follow
boolean
default:"true"
Whether to follow links matching this ruleCommon patterns:
  • true - Follow and extract (e.g., category pages)
  • false - Extract only (e.g., article pages, product pages)
Example:
"follow": false
priority
integer
default:"0"
Rule priority (higher = processed first)Validation:
  • Min: 0
  • Max: 1000
Use cases:
  • Prioritize important pages
  • Control crawl order
Example:
"priority": 10

Rule Matching

Allow/Deny Precedence

  1. If URL matches any deny pattern → rejected
  2. If URL matches any allow pattern → accepted
  3. If no allow patterns defined → accepted by default
  4. Otherwise → rejected

Restriction Scopes

  • restrict_xpaths and restrict_css limit where links are extracted from
  • Links outside these scopes are ignored, even if they match allow patterns

Examples

News Site

{
  "rules": [
    {
      "allow": ["/news/articles/.*"],
      "deny": ["/news/articles/.*#comments"],
      "callback": "parse_article",
      "follow": false
    },
    {
      "allow": ["/news/", "/sport/"],
      "callback": null,
      "follow": true
    }
  ]
}
Behavior:
  • Article pages (/news/articles/*) → Extract content, don’t follow links
  • Category pages (/news/, /sport/) → Follow links, don’t extract
  • Comments sections → Ignored

E-commerce

{
  "rules": [
    {
      "allow": ["/product/[^/]+$"],
      "callback": "parse_product",
      "follow": false,
      "priority": 5
    },
    {
      "allow": ["/products", "/category/"],
      "callback": null,
      "follow": true
    }
  ]
}
Behavior:
  • Product pages (/product/xyz) → Extract with custom callback, don’t follow
  • Listing pages → Follow links to discover products

Forum/Discussion

{
  "rules": [
    {
      "allow": ["/item\\?id=\\d+"],
      "deny": ["/vote", "/reply", "/user"],
      "callback": "parse_discussion",
      "follow": false
    }
  ]
}
Behavior:
  • Discussion threads → Extract
  • Vote/reply/user pages → Ignored
{
  "rules": [
    {
      "allow": ["/article/.*"],
      "restrict_css": ["div.main-content a", "nav.pagination a"],
      "callback": "parse_article"
    }
  ]
}
Behavior:
  • Only follow article links from main content and pagination
  • Ignore sidebar, footer, and navigation links

Common Patterns

Exact Path Match

"allow": ["/about$", "/contact$"]

Exclude Query Parameters

"deny": [".*\\?.*"]

Match Numeric IDs

"allow": ["/product/\\d+$", "/job/\\d+$"]

Multiple Domains

{
  "allowed_domains": ["example.com", "blog.example.com"],
  "rules": [
    {"allow": ["^https://example\\.com/products/.*"]},
    {"allow": ["^https://blog\\.example\\.com/posts/.*"]}
  ]
}

Pagination

{
  "allow": ["/page/\\d+$"],
  "callback": null,
  "follow": true
}

Rule Order

Rules are processed in the order defined in the rules array. Use priority to control processing order within Scrapy.

Validation Errors

Undefined Callback

Rule 0 references undefined callback: 'parse_product'. 
Defined callbacks: parse_article
Fix: Add callback to callbacks object or use parse_article

Invalid Callback Name

Invalid callback name: 'parse-product'. 
Must be a valid Python identifier.
Fix: Use underscores instead of hyphens: parse_product

Empty Patterns

Patterns must be non-empty strings
Fix: Remove empty strings from allow/deny arrays