Documentation Index Fetch the complete documentation index at: https://mintlify.com/discourselab/scrapai-cli/llms.txt
Use this file to discover all available pages before exploring further.
Migrate existing scrapers from Scrapy, BeautifulSoup, Scrapling, or any Python scraping framework to ScrapAI’s database-driven architecture.
Overview
Migrating to ScrapAI means converting your Python scraping code into JSON configs. The process:
Analyze existing code to understand extraction logic
Map to ScrapAI concepts (rules, extractors, callbacks)
Generate JSON config with equivalent behavior
Test and verify extraction quality
Deploy to database and retire old code
Why Migrate?
From README.md:170-179:
Your existing scrapers keep running while you verify. No big bang migration required.
Benefits:
Database-first management : Change settings across 100 spiders with one SQL query
Uniform structure : Consistent schema, validation, naming conventions
Built-in features : Cloudflare bypass, checkpoint, proxy escalation, incremental crawling
Easy to review : JSON configs are easier to audit than Python code
AI-assisted updates : Point an agent at a broken spider to auto-fix extraction rules
Migration Workflow
Using an AI agent (Claude Code, Cursor, etc.):
You: "Migrate my spider at scripts/bbc_spider.py to ScrapAI"
Agent: [Reads Python, extracts URL patterns and selectors, writes JSON config, tests, saves to database]
Manual Migration
For direct control:
Read your existing spider code
Extract URL patterns, selectors, and extraction logic
Write equivalent JSON config (see examples below)
Import: ./scrapai spiders import config.json --project myproject
Test: ./scrapai crawl spider_name --project myproject --limit 5
Compare output with original spider
Iterate until quality matches
Scrapy Spider Migration
Original Scrapy Spider
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class BBCSpider ( CrawlSpider ):
name = 'bbc'
allowed_domains = [ 'bbc.com' , 'bbc.co.uk' ]
start_urls = [ 'https://www.bbc.com/news' ]
rules = (
# Follow category pages
Rule(
LinkExtractor( allow = r '/news/ [ a-z_ ] + $ ' ),
follow = True ,
),
# Extract articles
Rule(
LinkExtractor( allow = r '/news/articles/ [ a-z0-9- ] + $ ' ),
callback = 'parse_article' ,
follow = False ,
),
)
def parse_article ( self , response ):
yield {
'url' : response.url,
'title' : response.css( 'h1.article-headline::text' ).get(),
'content' : ' ' .join(response.css( 'div.article-body p::text' ).getall()),
'author' : response.css( 'span.author-name::text' ).get(),
'date' : response.css( 'time.date-published::attr(datetime)' ).get(),
}
Equivalent ScrapAI Config
{
"name" : "bbc_news" ,
"source_url" : "https://www.bbc.com/news" ,
"allowed_domains" : [ "bbc.com" , "bbc.co.uk" ],
"start_urls" : [ "https://www.bbc.com/news" ],
"rules" : [
{
"allow" : [ "/news/[a-z_]+$" ],
"follow" : true ,
"priority" : 10
},
{
"allow" : [ "/news/articles/[a-z0-9-]+$" ],
"callback" : "parse_article" ,
"follow" : false ,
"priority" : 100
}
],
"settings" : {
"EXTRACTOR_ORDER" : [ "newspaper" , "trafilatura" ],
"DOWNLOAD_DELAY" : 1
},
"callbacks" : {
"parse_article" : {
"extract" : {
"title" : {
"css" : "h1.article-headline::text"
},
"content" : {
"css" : "div.article-body p::text" ,
"get_all" : true ,
"processors" : [
{ "type" : "join" , "separator" : " " }
]
},
"author" : {
"css" : "span.author-name::text"
},
"published_date" : {
"css" : "time.date-published::attr(datetime)" ,
"processors" : [
{ "type" : "parse_datetime" }
]
}
}
}
}
}
Key Mappings
Scrapy Concept ScrapAI Equivalent namenameallowed_domainsallowed_domainsstart_urlsstart_urlsLinkExtractor(allow=...)rules[].allowLinkExtractor(deny=...)rules[].denyRule(follow=True)rules[].follow: trueRule(callback='parse')rules[].callback: "parse"response.css('selector::text').get()css: "selector::text"response.css('selector::text').getall()css: "selector::text", get_all: trueresponse.xpath('//div')xpath: "//div"' '.join(texts)processors: [{"type": "join"}]
BeautifulSoup Migration
Original BeautifulSoup Script
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import json
url = 'https://example.com/products'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser' )
products = []
for item in soup.select( 'div.product-card' ):
product = {
'name' : item.select_one( 'h3.product-name' ).text.strip(),
'price' : item.select_one( 'span.price' ).text.strip().replace( '$' , '' ),
'url' : urljoin(url, item.select_one( 'a' )[ 'href' ]),
'image' : item.select_one( 'img' )[ 'src' ],
}
products.append(product)
with open ( 'products.json' , 'w' ) as f:
json.dump(products, f, indent = 2 )
Equivalent ScrapAI Config
{
"name" : "example_products" ,
"source_url" : "https://example.com/products" ,
"allowed_domains" : [ "example.com" ],
"start_urls" : [ "https://example.com/products" ],
"rules" : [
{
"allow" : [ "/products$" ],
"callback" : "parse_products" ,
"follow" : false
}
],
"settings" : {
"DOWNLOAD_DELAY" : 0.5
},
"callbacks" : {
"parse_products" : {
"extract" : {
"products" : {
"type" : "nested_list" ,
"selector" : "div.product-card" ,
"extract" : {
"name" : {
"css" : "h3.product-name::text" ,
"processors" : [{ "type" : "strip" }]
},
"price" : {
"css" : "span.price::text" ,
"processors" : [
{ "type" : "strip" },
{ "type" : "replace" , "old" : "$" , "new" : "" },
{ "type" : "cast" , "to" : "float" }
]
},
"url" : {
"css" : "a::attr(href)"
},
"image" : {
"css" : "img::attr(src)"
}
}
}
}
}
}
}
Key Differences
BeautifulSoup:
Manual HTTP requests
Manual link extraction
Manual JSON export
No retry logic
No rate limiting
ScrapAI:
Scrapy handles requests (retries, delays, middleware)
Automatic link extraction via rules
Automatic JSONL export
Built-in retry and error handling
Configurable rate limiting
Scrapling Migration
From README.md:32:
For single-site scraping with fine-grained control, use Scrapling. ScrapAI is for multi-site fleets.
When to migrate from Scrapling:
You have 10+ sites to scrape
Sites have similar structure (e.g., all news sites)
You want database-driven management
You need scheduling and monitoring
When to keep Scrapling:
Single site with complex interaction
Heavy JavaScript rendering
Fine-grained control needed
Login/auth flows
Example Migration
from scrapling import Fetcher
fetcher = Fetcher()
page = fetcher.get( 'https://news.ycombinator.com' )
titles = page.css( 'span.titleline > a' ).text_content( all = True )
for title in titles:
print (title)
{
"name" : "hackernews" ,
"source_url" : "https://news.ycombinator.com" ,
"allowed_domains" : [ "news.ycombinator.com" ],
"start_urls" : [ "https://news.ycombinator.com" ],
"rules" : [
{
"allow" : [ "/$" ],
"callback" : "parse_frontpage" ,
"follow" : false
}
],
"callbacks" : {
"parse_frontpage" : {
"extract" : {
"stories" : {
"type" : "nested_list" ,
"selector" : "span.titleline > a" ,
"extract" : {
"title" : {
"css" : "::text"
},
"url" : {
"css" : "::attr(href)"
}
}
}
}
}
}
}
Processors for Data Cleaning
From core/schemas.py:131-156:
allowed = {
"strip" ,
"replace" ,
"regex" ,
"cast" ,
"join" ,
"default" ,
"lowercase" ,
"parse_datetime" ,
}
Common Processor Patterns
Strip whitespace:
Remove characters:
{ "type" : "replace" , "old" : "$" , "new" : "" }
Extract with regex:
{ "type" : "regex" , "pattern" : " \\ d+" }
Convert type:
{ "type" : "cast" , "to" : "float" }
Join list:
{ "type" : "join" , "separator" : " " }
Default value:
{ "type" : "default" , "value" : "Unknown" }
Lowercase:
Parse datetime:
{ "type" : "parse_datetime" }
Validation During Migration
All configs go through strict validation before import. From core/schemas.py:215-402:
Spider Name Validation
@field_validator ( "name" )
@ classmethod
def validate_name ( cls , v ):
if not re.match( r " ^ [ a-zA-Z0-9_- ] + $ " , v):
raise ValueError (
f "Invalid spider name: { v } . "
"Only alphanumeric characters, underscores, and hyphens allowed."
)
return v
URL Validation (SSRF Protection)
@field_validator ( "source_url" , "start_urls" )
@ classmethod
def validate_urls ( cls , v ):
allowed_schemes = { "http" , "https" }
# Check scheme
if not any (url.lower().startswith( f " { scheme } ://" ) for scheme in allowed_schemes):
raise ValueError (
f "Invalid URL scheme: { url } . Only HTTP and HTTPS are allowed."
)
# Prevent SSRF to localhost/private IPs
parsed = urlparse(url)
hostname = parsed.hostname
if hostname in ( "localhost" , "0.0.0.0" ):
raise ValueError (
f "URL points to localhost: { url } . Blocked to prevent SSRF attacks."
)
# Check if resolves to private IP
ip = ipaddress.ip_address(hostname)
if ip.is_private or ip.is_loopback:
raise ValueError (
f "URL points to private IP: { url } . Blocked to prevent SSRF attacks."
)
Callback Validation
@field_validator ( "callbacks" )
@ classmethod
def validate_callbacks ( cls , v ):
reserved_names = {
"parse_article" ,
"parse_start_url" ,
"start_requests" ,
"from_crawler" ,
"closed" ,
"parse" ,
}
for callback_name in v.keys():
# Must be valid Python identifier
if not re.match( r " ^ [ a-zA-Z_ ][ a-zA-Z0-9_ ] * $ " , callback_name):
raise ValueError (
f "Invalid callback name: ' { callback_name } '. "
"Must be a valid Python identifier."
)
# Must not be reserved
if callback_name in reserved_names:
raise ValueError (
f "Callback name ' { callback_name } ' is reserved and cannot be used."
)
Testing After Migration
Compare Output Quality
# Run old spider
python old_spider.py > old_output.json
# Run new ScrapAI spider
./scrapai crawl new_spider --project myproject --limit 10
./scrapai export new_spider --project myproject --format json > new_output.json
# Compare field coverage
python -c "
import json
old = json.load(open('old_output.json'))
new = json.load(open('new_output.json'))
old_fields = set(old[0].keys())
new_fields = set(new[0].keys())
print('Missing fields:', old_fields - new_fields)
print('Extra fields:', new_fields - old_fields)
"
# Inspect a sample page
./scrapai inspect https://example.com/article --project myproject
# Check if selectors match
grep 'title' output.html
grep 'content' output.html
# Time old spider
time python old_spider.py
# Time new spider
time ./scrapai crawl new_spider --project myproject
Incremental Migration Strategy
Phase 1: Pilot (1-2 weeks)
Pick 3-5 representative spiders
Migrate to ScrapAI
Run both old and new in parallel
Compare output quality
Tune extraction rules until quality matches
Phase 2: Batch Migration (2-4 weeks)
Group remaining spiders by similarity
Migrate one group at a time
Reuse patterns from pilot spiders
Test each batch before moving to next
Phase 3: Cutover (1 week)
Switch production traffic to ScrapAI
Keep old spiders as backup for 1 month
Monitor error rates and data quality
Retire old code once confident
Phase 4: Optimization (ongoing)
Tune DOWNLOAD_DELAY and CONCURRENT_REQUESTS
Enable DeltaFetch for incremental crawling
Set up Airflow for scheduling
Add custom callbacks for edge cases
Common Pitfalls
Regex Patterns
Problem : Scrapy uses Python regex, ScrapAI uses the same.
Solution : Copy patterns directly, but test with ./scrapai crawl --limit 5.
Relative vs. Absolute URLs
Problem : Old spider might use urljoin() for relative URLs.
Solution : Scrapy handles this automatically. Just use css: "a::attr(href)".
Custom Middleware
Problem : Old spider uses custom Scrapy middleware.
Solution :
Proxy rotation : Use ScrapAI’s built-in proxy escalation
Cloudflare : Enable CLOUDFLARE_ENABLED: true
Custom headers : Add to spider settings
Other middleware : May require framework changes (contribute!)
Dynamic Content (JavaScript)
Problem : Old spider uses Selenium or Playwright.
Solution : Use ScrapAI’s Playwright extractor:
{
"settings" : {
"EXTRACTOR_ORDER" : [ "playwright" ],
"PLAYWRIGHT_WAIT_SELECTOR" : "div.content"
}
}
See Also
Custom Callbacks Write custom extraction logic for complex sites
Security Understanding config validation and SSRF protection