ScrapAI supports multiple extraction strategies that can be chained with fallback. Each extractor tries to extract content, and if it fails, the next one is attempted.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/discourselab/scrapai-cli/llms.txt
Use this file to discover all available pages before exploring further.
Available Extractors
Newspaper4k
General-purpose article extractor for news and blogs
Trafilatura
Lightweight content extraction with high accuracy
Custom CSS
Site-specific CSS selectors for structured data
Playwright
Browser rendering for JavaScript-heavy sites
Extraction Order
Configure extraction order in spider settings:- Try first extractor (e.g.,
newspaper) - If extraction fails or content too short → try next extractor
- Continue until successful extraction or all extractors exhausted
- Return
ScrapedArticleorNone
Strategy Selection
| Scenario | Recommended Order |
|---|---|
| Generic news/blog (clean HTML) | ["newspaper", "trafilatura"] |
| Generic extractors fail | ["custom", "newspaper", "trafilatura"] |
| JavaScript-rendered content (SPA) | ["playwright", "trafilatura"] |
| JS-rendered + custom structure | ["playwright", "custom"] |
| E-commerce, jobs, listings | ["custom"] (with callbacks) |
| Infinite scroll page | ["playwright"] (single extractor) |
Extractor Comparison
| Feature | Newspaper | Trafilatura | Custom | Playwright |
|---|---|---|---|---|
| Speed | Fast | Fast | Fast | Slow |
| Accuracy | Good | Excellent | Perfect (if configured) | Good |
| Setup | None | None | Requires CSS selectors | Requires wait config |
| Use Case | News articles | Any content | Structured data | JS content |
| Metadata | Keywords, summary, top_image | Description, tags, fingerprint | Custom fields | None (uses trafilatura) |
Content Validation
All extractors returnScrapedArticle which validates:
Title:
- Must exist
- Min length: 5 characters
- Must exist
- Min length: 100 characters
None, next strategy is tried
ScrapedArticle Schema
Page URL
Article/page title (min 5 chars)
Main content text (min 100 chars)
Author name (if available)
Publication date (if available)
Extractor used:
"newspaper4k", "trafilatura", "custom", "playwright"Extraction timestamp (UTC)
Extractor-specific or custom fieldsNewspaper metadata:
top_image- Main image URLkeywords- Extracted keywordssummary- Auto-generated summary
description- Meta descriptionsitename- Site namecategories,tags,fingerprint,license
- Any fields from
CUSTOM_SELECTORS(except title/content/author/date) - Any fields from callback
extractconfig
Raw HTML (only if
include_html=True in export)Configuration Examples
Generic News Site
- Try
newspaperfirst (fast, good for news) - Fallback to
trafilaturaif newspaper fails
Custom Selectors with Fallback
- Try custom selectors first (highest accuracy)
- Fallback to generic extractors if selectors fail
JavaScript-Rendered Site
- Render page with Playwright
- Wait for
.article-contentto appear - Extract with trafilatura from rendered HTML
E-commerce (Custom Only)
- No generic extractors (not article content)
- Use callback-based extraction only
Fallback Behavior
Extraction fails when:- Selector returns no match
- Content/title too short (< 100 chars / < 5 chars)
- Parser exception
- Extractor returns
None - Validation fails on returned
ScrapedArticle
- Page is skipped
- Error logged
- No item saved to database
Performance Considerations
Fast extractors (news/blogs):Debugging Extraction
Test extraction order:source field in scraped items:
source: newspaper4k or source: trafilatura, etc.
Related
- Newspaper Extractor - Newspaper4k configuration
- Trafilatura Extractor - Trafilatura options
- Custom Extractors - CSS selector syntax
- Playwright Extractor - Browser rendering
- Settings - Complete settings reference