Crawling¶
Batch-extract articles from multiple pages using the integrated crawler. The crawler supports BFS traversal from seed URLs, sitemap ingestion, URL filtering, and automatic Markdown output with manifest tracking.
Audience: Developers running multi-page extraction workflows.
Prerequisites: CLI installed and write access to the output directory.
Time: ~10-20 minutes for the first crawl run.
What you'll learn: How to seed, configure, and verify crawler output.
Quick Start¶
CLI¶
# Crawl from a seed URL
uv run article-extractor crawl \
--seed https://example.com/blog \
--output-dir ./crawl-output
# Crawl from a sitemap
uv run article-extractor crawl \
--sitemap https://example.com/sitemap.xml \
--output-dir ./crawl-output
# Combine seeds and sitemaps with filters
uv run article-extractor crawl \
--seed https://example.com/docs \
--sitemap https://example.com/sitemap.xml \
--allow-prefix https://example.com/docs/ \
--deny-prefix https://example.com/docs/private/ \
--max-pages 50 \
--output-dir ./docs-crawl
HTTP API¶
# Submit a crawl job
curl -X POST http://localhost:3000/crawl \
-H "Content-Type: application/json" \
-d '{
"output_dir": "/data/crawl-output",
"seeds": ["https://example.com/blog"],
"max_pages": 100
}'
# Response: {"job_id": "abc-123", "status": "queued", ...}
# Poll job status
curl http://localhost:3000/crawl/abc-123
# Download manifest when complete
curl http://localhost:3000/crawl/abc-123/manifest -o manifest.json
Configuration¶
CLI Options¶
| Option | Default | Description |
|---|---|---|
--seed URL |
— | Seed URL to start crawling (repeatable) |
--sitemap URL |
— | Sitemap URL or local file path (repeatable) |
--output-dir PATH |
prompts | Output directory for Markdown files |
--allow-prefix PREFIX |
all | Only crawl URLs starting with prefix (repeatable) |
--deny-prefix PREFIX |
none | Skip URLs starting with prefix (repeatable) |
--max-pages N |
100 | Maximum pages to crawl |
--max-depth N |
3 | Maximum BFS depth from seeds |
--concurrency N |
5 | Concurrent requests |
--rate-limit SECONDS |
1.0 | Delay between requests to same host |
--follow-links |
true | Discover and follow links in pages |
--no-follow-links |
— | Only crawl seed/sitemap URLs |
Network options (--headed, --storage-state, --prefer-playwright, etc.) work identically to single-URL extraction.
Environment Variables¶
| Variable | Default | Description |
|---|---|---|
ARTICLE_EXTRACTOR_CRAWLER_CONCURRENCY |
5 | Default concurrent requests |
ARTICLE_EXTRACTOR_CRAWLER_RATE_LIMIT |
1.0 | Default rate limit delay |
ARTICLE_EXTRACTOR_CRAWLER_MAX_PAGES |
100 | Default page limit |
Output Structure¶
All extracted pages are saved as flat Markdown files in the output directory. Path separators (/) in URLs are replaced with double underscores (__) to create a flat structure:
output-dir/
├── manifest.json # Crawl metadata and results
├── example.com__blog.md
├── example.com__blog__post-1.md
├── example.com__blog__post-2.md
└── docs.example.com__getting-started.md
For deeply nested URLs (like wiki pages), the flat structure avoids excessive directory nesting:
# URL: https://wiki.example.com/spaces/DOCS/pages/12345678/GettingStarted
# File: wiki.example.com__spaces__DOCS__pages__12345678__GettingStarted.md
Markdown Format¶
Each extracted page is saved as a Markdown file with YAML frontmatter:
---
url: "https://example.com/blog/post-1"
title: "My First Blog Post"
extracted_at: "2026-01-05T12:30:00Z"
word_count: 1500
---
# My First Blog Post
Article content in Markdown format...
Manifest Schema¶
{
"job_id": "abc-123",
"started_at": "2026-01-05T12:00:00Z",
"completed_at": "2026-01-05T12:30:00Z",
"config": {
"seeds": ["https://example.com/blog"],
"max_pages": 100,
"concurrency": 5
},
"total_pages": 42,
"successful": 40,
"failed": 1,
"skipped": 1,
"duration_seconds": 1800.5,
"results": [
{
"url": "https://example.com/blog/post-1",
"file_path": "example.com__blog__post-1.md",
"status": "success",
"word_count": 1500,
"title": "My First Blog Post"
}
]
}
URL Filtering¶
The crawler applies filters in order:
- Allow prefixes — If specified, URL must start with at least one prefix
- Deny prefixes — URL is rejected if it starts with any deny prefix
- Same-origin — By default, only URLs from the same origin as seeds are followed
# Crawl only /docs/ pages, excluding /docs/internal/
uv run article-extractor crawl \
--seed https://example.com/docs \
--allow-prefix https://example.com/docs/ \
--deny-prefix https://example.com/docs/internal/ \
--output-dir ./docs
Rate Limiting¶
The crawler enforces per-host rate limiting to avoid overloading servers:
- Default: 1 second between requests to the same host
- Concurrent requests are limited to different hosts
- 429 responses trigger exponential backoff
# Slower crawl for sensitive servers
uv run article-extractor crawl \
--seed https://fragile-server.com \
--rate-limit 3.0 \
--concurrency 2 \
--output-dir ./output
Headed Mode for Protected Sites¶
For sites requiring authentication or CAPTCHA solving:
# Interactive crawl with stored session
uv run article-extractor crawl \
--seed https://protected-site.com \
--headed \
--storage-state ./session.json \
--user-interaction-timeout 60 \
--max-pages 20 \
--output-dir ./protected-output
The crawler pauses on the first page to let you log in, then continues automatically.
API Endpoints¶
POST /crawl¶
Submit a new crawl job.
Request:
{
"output_dir": "/data/output",
"seeds": ["https://example.com"],
"sitemaps": [],
"allow_prefixes": [],
"deny_prefixes": [],
"max_pages": 100,
"max_depth": 3,
"concurrency": 5,
"rate_limit_delay": 1.0,
"follow_links": true,
"network": {
"headed": false,
"storage_state": null
}
}
Response (202 Accepted):
{
"job_id": "abc-123",
"status": "queued",
"progress": 0,
"total": 0
}
GET /crawl/{job_id}¶
Poll job status.
Response:
{
"job_id": "abc-123",
"status": "running",
"progress": 25,
"total": 100,
"successful": 24,
"failed": 1,
"skipped": 0,
"started_at": "2026-01-05T12:00:00Z"
}
Status values: queued, running, completed, failed
GET /crawl/{job_id}/manifest¶
Download the manifest.json for a completed job.
Response: The manifest JSON file.
Errors: - 404 if job not found - 400 if job not completed
Troubleshooting¶
Crawl hangs or times out¶
- Reduce
--concurrencyto avoid overwhelming servers - Increase
--rate-limitfor rate-limited sites - Use
--headedmode to debug JavaScript-heavy pages
Missing pages¶
- Check
manifest.jsonfor failed/skipped entries - Verify URL filters aren't too restrictive
- Ensure
--max-depthis sufficient for deep hierarchies
Disk space warnings¶
The crawler warns if output directory has less than 100MB free. For large crawls:
# Check available space before crawling
df -h /path/to/output
# Use --max-pages to limit scope
uv run article-extractor crawl \
--seed https://large-site.com \
--max-pages 500 \
--output-dir /data/large-crawl