Documentation Index
Fetch the complete documentation index at: https://mintlify.com/discourselab/scrapai-cli/llms.txt
Use this file to discover all available pages before exploring further.
Processors transform extracted values (strip whitespace, cast types, apply regex, etc.). They run sequentially in the order specified.
Available Processors
strip
Remove leading and trailing whitespace
replace
Replace substring in strings
regex
Extract substring using pattern
cast
Convert to specified type
join
Join list values to string
default
Return fallback value if empty
lowercase
Convert strings to lowercase
parse_datetime
Parse datetime to ISO format
Processor Reference
1. strip
Remove leading and trailing whitespace from strings.
Parameters: None
Example:
{
"css": "h1::text",
"processors": [{"type": "strip"}]
}
Transformation:
" Hello World " → "Hello World"
Works on: Strings and lists of strings
2. replace
Replace substring in strings.
Parameters:
old (required): Substring to replace
new (required): Replacement string
Example:
{
"css": "span.price::text",
"processors": [
{"type": "replace", "old": "$", "new": ""},
{"type": "replace", "old": ",", "new": ""}
]
}
Transformation:
Works on: Strings and lists of strings
3. regex
Extract substring using regular expression pattern.
Parameters:
pattern (required): Regex pattern to match
group (optional): Capture group to extract (default: 1)
Example:
{
"css": "span::text",
"processors": [
{"type": "regex", "pattern": "Price: \\$([\\d.]+)"}
]
}
Transformation:
"Price: $99.99" → "99.99"
Multiple groups:
{"type": "regex", "pattern": "(\\d+) items", "group": 1}
Returns original value if no match.
Works on: Strings only
4. cast
Convert value to specified type.
Parameters:
to (required): Target type - "int", "float", "bool", or "str"
Example:
{
"css": "span.rating::attr(data-rating)",
"processors": [
{"type": "cast", "to": "float"}
]
}
Transformations:
"4.5" → 4.5 (float)
"42" → 42 (int)
"true" → True (bool)
Boolean conversion:
true, 1, yes, on → True
- Everything else → False
Returns None if conversion fails.
Works on: Any type
5. join
Join list values into a single string.
Parameters:
separator (optional): String to join with (default: " ")
Example:
{
"css": "li.feature::text",
"get_all": true,
"processors": [
{"type": "join", "separator": ", "}
]
}
Transformation:
["WiFi", "Bluetooth", "GPS"] → "WiFi, Bluetooth, GPS"
Filters out None values automatically.
Works on: Lists only
6. default
Return default value if input is None, empty string, or empty list.
Parameters:
default (required): Fallback value
Example:
{
"css": "span.optional::text",
"processors": [
{"type": "default", "default": "N/A"}
]
}
Transformations:
None → "N/A"
"" → "N/A"
[] → "N/A"
"actual value" → "actual value"
Works on: Any type
7. lowercase
Convert strings to lowercase.
Parameters: None
Example:
{
"css": "span.status::text",
"processors": [
{"type": "strip"},
{"type": "lowercase"}
]
}
Transformation:
Works on: Strings and lists of strings
8. parse_datetime
Parse datetime string into ISO format.
Parameters:
format (optional): strptime format string (if None, uses dateutil parser for flexible parsing)
Example with format:
{
"css": "time.date::attr(datetime)",
"processors": [
{"type": "parse_datetime", "format": "%Y-%m-%d"}
]
}
Example without format (auto-detect):
{
"css": "span.date::text",
"processors": [
{"type": "parse_datetime"}
]
}
Transformations:
"2024-02-24" → "2024-02-24T00:00:00" (ISO format)
"February 24, 2024" → "2024-02-24T00:00:00"
"24/02/2024" → "2024-02-24T00:00:00" (auto-detected)
Stored as ISO string in database (automatically serialized).
Returns None if parsing fails.
Works on: Strings only
Processor Chaining
Processors run sequentially. Output of one becomes input to the next.
Example 1: Clean and Convert Price
{
"css": "span.price::text",
"processors": [
{"type": "strip"}, // " $99.99 " → "$99.99"
{"type": "replace", "old": "$", "new": ""}, // "$99.99" → "99.99"
{"type": "cast", "to": "float"} // "99.99" → 99.99
]
}
{
"css": "div.rating::text",
"processors": [
{"type": "strip"}, // " Rating: 4.5 stars " → "Rating: 4.5 stars"
{"type": "regex", "pattern": "([\\d.]+)"}, // "Rating: 4.5 stars" → "4.5"
{"type": "cast", "to": "float"} // "4.5" → 4.5
]
}
Example 3: Normalize Text
{
"css": "span.status::text",
"processors": [
{"type": "strip"},
{"type": "lowercase"},
{"type": "replace", "old": " ", "new": "_"}
]
}
Input: " In Stock "
Output: "in_stock"
Example 4: Handle Missing Values
{
"css": "span.optional-field::text",
"processors": [
{"type": "strip"},
{"type": "default", "default": "Not specified"}
]
}
Common Patterns
{
"price": {
"css": "span.price::text",
"processors": [
{"type": "strip"},
{"type": "regex", "pattern": "\\$([\\d,.]+)"},
{"type": "replace", "old": ",", "new": ""},
{"type": "cast", "to": "float"}
]
}
}
Handles: "$1,299.99", "Price: $99", " $42.50 "
{
"quantity": {
"css": "div.quantity::text",
"processors": [
{"type": "regex", "pattern": "(\\d+)"},
{"type": "cast", "to": "int"}
]
}
}
Handles: "23 items", "Quantity: 5", "42"
Boolean Fields
{
"in_stock": {
"css": "span.availability::text",
"processors": [
{"type": "lowercase"},
{"type": "regex", "pattern": "(in stock|available)"},
{"type": "cast", "to": "bool"}
]
}
}
Returns: True if “in stock” or “available”, else False
Date Fields
{
"published_date": {
"css": "time::attr(datetime)",
"processors": [
{"type": "parse_datetime"}
]
}
}
Auto-detects format, stores as ISO string.
Lists to Comma-Separated String
{
"tags": {
"css": "li.tag::text",
"get_all": true,
"processors": [
{"type": "join", "separator": ", "}
]
}
}
Input: ["Python", "Web Scraping", "Automation"]
Output: "Python, Web Scraping, Automation"
Complete Examples
E-commerce Product
{
"callbacks": {
"parse_product": {
"extract": {
"name": {
"css": "h1.product-name::text",
"processors": [{"type": "strip"}]
},
"price": {
"css": "span.price::text",
"processors": [
{"type": "strip"},
{"type": "regex", "pattern": "\\$([\\d,.]+)"},
{"type": "replace", "old": ",", "new": ""},
{"type": "cast", "to": "float"}
]
},
"rating": {
"css": "div.rating::attr(data-rating)",
"processors": [{"type": "cast", "to": "float"}]
},
"in_stock": {
"css": "span.availability::text",
"processors": [
{"type": "lowercase"},
{"type": "regex", "pattern": "(in stock|available)"},
{"type": "cast", "to": "bool"}
]
},
"features": {
"css": "li.feature::text",
"get_all": true,
"processors": [{"type": "join", "separator": ", "}]
}
}
}
}
}
Job Listing
{
"callbacks": {
"parse_job": {
"extract": {
"title": {
"css": "h1.job-title::text",
"processors": [{"type": "strip"}]
},
"salary_min": {
"css": "span.salary-min::text",
"processors": [
{"type": "strip"},
{"type": "replace", "old": "$", "new": ""},
{"type": "replace", "old": ",", "new": ""},
{"type": "cast", "to": "int"}
]
},
"salary_max": {
"css": "span.salary-max::text",
"processors": [
{"type": "strip"},
{"type": "replace", "old": "$", "new": ""},
{"type": "replace", "old": ",", "new": ""},
{"type": "cast", "to": "int"}
]
},
"posted_date": {
"css": "time.posted-date::attr(datetime)",
"processors": [{"type": "parse_datetime"}]
},
"remote": {
"css": "span.job-type::text",
"processors": [
{"type": "lowercase"},
{"type": "regex", "pattern": "(remote|work from home)"},
{"type": "cast", "to": "bool"}
]
},
"skills": {
"css": "span.skill::text",
"get_all": true,
"processors": [{"type": "join", "separator": ", "}]
}
}
}
}
}
Real Estate Listing
{
"callbacks": {
"parse_property": {
"extract": {
"address": {
"css": "h1.property-address::text",
"processors": [{"type": "strip"}]
},
"price": {
"css": "span.property-price::text",
"processors": [
{"type": "strip"},
{"type": "regex", "pattern": "\\$([\\d,.]+)"},
{"type": "replace", "old": ",", "new": ""},
{"type": "cast", "to": "float"}
]
},
"bedrooms": {
"css": "span.bedrooms::text",
"processors": [
{"type": "regex", "pattern": "(\\d+)"},
{"type": "cast", "to": "int"}
]
},
"bathrooms": {
"css": "span.bathrooms::text",
"processors": [
{"type": "regex", "pattern": "([\\d.]+)"},
{"type": "cast", "to": "float"}
]
},
"sqft": {
"css": "span.square-feet::text",
"processors": [
{"type": "regex", "pattern": "([\\d,]+)"},
{"type": "replace", "old": ",", "new": ""},
{"type": "cast", "to": "int"}
]
},
"amenities": {
"css": "li.amenity::text",
"get_all": true,
"processors": [{"type": "join", "separator": ", "}]
}
}
}
}
}
Error Handling
Processors handle errors gracefully:
Graceful Failures
Chain Behavior
- strip, replace, lowercase, join: Return original value if not applicable type
- regex: Returns original value if pattern doesn’t match
- cast: Returns None if conversion fails
- parse_datetime: Returns None if parsing fails
- Unknown processor type: Skipped, logs warning
If a processor fails mid-chain, subsequent processors receive the last valid value or None.Example:[
{"type": "strip"}, // " abc " → "abc"
{"type": "cast", "to": "int"}, // "abc" → None (conversion fails)
{"type": "default", "default": 0} // None → 0
]
Final output: 0
Best Practices
Always strip text fields
Prevents whitespace issues:{"processors": [{"type": "strip"}]}
Use regex before cast
Extract numeric part first, then convert type:[
{"type": "regex", "pattern": "([\\d.]+)"},
{"type": "cast", "to": "float"}
]
Chain replace for complex cleaning
Multiple replace processors handle different cases:[
{"type": "replace", "old": "$", "new": ""},
{"type": "replace", "old": ",", "new": ""}
]
Default at the end
Apply fallback after all transformations:[
{"type": "strip"},
{"type": "cast", "to": "float"},
{"type": "default", "default": 0.0}
]
Test selectors first
Use analyze command before adding processors:./scrapai analyze --test "selector"
Validate processor output
Run test crawl and check with show command:./scrapai crawl spider --limit 5 --project proj
./scrapai show 1 --project proj
Troubleshooting
Processor Returns None
Check processor type
Verify processor name is correct (typo?)
Validate input type
Some processors only work on specific types:
regex: strings only
join: lists only
parse_datetime: strings only
Test without processors
Remove processors temporarily to see raw extracted value
Check logs
Look for processor warnings in crawl logs
Wrong Output Type
Add cast processor at the end:{"processors": [{"type": "cast", "to": "float"}]}
Regex Not Matching
Test pattern separately
Use online regex tester (regex101.com)
Check escaping
Double backslashes in JSON:{"pattern": "\\$([\\d.]+)"}
Add default fallback
[
{"type": "regex", "pattern": "([\\d.]+)"},
{"type": "default", "default": null}
]
Date Parsing Fails
Try without format
Let dateutil auto-detect:{"type": "parse_datetime"}
Specify exact format
If auto-detect fails:{"type": "parse_datetime", "format": "%Y-%m-%d"}
Check date format
View raw extracted value to understand format
Custom Callbacks
Extract structured data with callbacks
Extractors
Content extraction strategies