Crawling¶

Batch-extract articles from multiple pages using the integrated crawler. The crawler supports BFS traversal from seed URLs, sitemap ingestion, URL filtering, and automatic Markdown output with manifest tracking.

Audience: Developers running multi-page extraction workflows.
Prerequisites: CLI installed and write access to the output directory.
Time: ~10-20 minutes for the first crawl run.
What you'll learn: How to seed, configure, and verify crawler output.

Quick Start¶

CLI¶

# Crawl from a seed URL
uv run article-extractor crawl \
  --seed https://example.com/blog \
  --output-dir ./crawl-output

# Crawl from a sitemap
uv run article-extractor crawl \
  --sitemap https://example.com/sitemap.xml \
  --output-dir ./crawl-output

# Combine seeds and sitemaps with filters
uv run article-extractor crawl \
  --seed https://example.com/docs \
  --sitemap https://example.com/sitemap.xml \
  --allow-prefix https://example.com/docs/ \
  --deny-prefix https://example.com/docs/private/ \
  --max-pages 50 \
  --output-dir ./docs-crawl

HTTP API¶

# Submit a crawl job
curl -X POST http://localhost:3000/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "output_dir": "/data/crawl-output",
    "seeds": ["https://example.com/blog"],
    "max_pages": 100
  }'

# Response: {"job_id": "abc-123", "status": "queued", ...}

# Poll job status
curl http://localhost:3000/crawl/abc-123

# Download manifest when complete
curl http://localhost:3000/crawl/abc-123/manifest -o manifest.json

Configuration¶

CLI Options¶

Option	Default	Description
`--seed URL`	—	Seed URL to start crawling (repeatable)
`--sitemap URL`	—	Sitemap URL or local file path (repeatable)
`--output-dir PATH`	prompts	Output directory for Markdown files
`--allow-prefix PREFIX`	all	Only crawl URLs starting with prefix (repeatable)
`--deny-prefix PREFIX`	none	Skip URLs starting with prefix (repeatable)
`--max-pages N`	100	Maximum pages to crawl
`--max-depth N`	3	Maximum BFS depth from seeds
`--concurrency N`	5	Concurrent requests
`--rate-limit SECONDS`	1.0	Delay between requests to same host
`--follow-links`	true	Discover and follow links in pages
`--no-follow-links`	—	Only crawl seed/sitemap URLs

Network options (--headed, --storage-state, --prefer-playwright, etc.) work identically to single-URL extraction.

Environment Variables¶

Variable	Default	Description
`ARTICLE_EXTRACTOR_CRAWLER_CONCURRENCY`	5	Default concurrent requests
`ARTICLE_EXTRACTOR_CRAWLER_RATE_LIMIT`	1.0	Default rate limit delay
`ARTICLE_EXTRACTOR_CRAWLER_MAX_PAGES`	100	Default page limit

Output Structure¶

All extracted pages are saved as flat Markdown files in the output directory. Path separators (/) in URLs are replaced with double underscores (__) to create a flat structure:

output-dir/
├── manifest.json                           # Crawl metadata and results
├── example.com__blog.md
├── example.com__blog__post-1.md
├── example.com__blog__post-2.md
└── docs.example.com__getting-started.md

For deeply nested URLs (like wiki pages), the flat structure avoids excessive directory nesting:

# URL: https://wiki.example.com/spaces/DOCS/pages/12345678/GettingStarted
# File: wiki.example.com__spaces__DOCS__pages__12345678__GettingStarted.md

Markdown Format¶

Each extracted page is saved as a Markdown file with YAML frontmatter:

---
url: "https://example.com/blog/post-1"
title: "My First Blog Post"
extracted_at: "2026-01-05T12:30:00Z"
word_count: 1500
---

# My First Blog Post

Article content in Markdown format...

Manifest Schema¶

{
  "job_id": "abc-123",
  "started_at": "2026-01-05T12:00:00Z",
  "completed_at": "2026-01-05T12:30:00Z",
  "config": {
    "seeds": ["https://example.com/blog"],
    "max_pages": 100,
    "concurrency": 5
  },
  "total_pages": 42,
  "successful": 40,
  "failed": 1,
  "skipped": 1,
  "duration_seconds": 1800.5,
  "results": [
    {
      "url": "https://example.com/blog/post-1",
      "file_path": "example.com__blog__post-1.md",
      "status": "success",
      "word_count": 1500,
      "title": "My First Blog Post"
    }
  ]
}

URL Filtering¶

The crawler applies filters in order:

Allow prefixes — If specified, URL must start with at least one prefix
Deny prefixes — URL is rejected if it starts with any deny prefix
Same-origin — By default, only URLs from the same origin as seeds are followed

# Crawl only /docs/ pages, excluding /docs/internal/
uv run article-extractor crawl \
  --seed https://example.com/docs \
  --allow-prefix https://example.com/docs/ \
  --deny-prefix https://example.com/docs/internal/ \
  --output-dir ./docs

Rate Limiting¶

The crawler enforces per-host rate limiting to avoid overloading servers:

Default: 1 second between requests to the same host
Concurrent requests are limited to different hosts
429 responses trigger exponential backoff

# Slower crawl for sensitive servers
uv run article-extractor crawl \
  --seed https://fragile-server.com \
  --rate-limit 3.0 \
  --concurrency 2 \
  --output-dir ./output

Headed Mode for Protected Sites¶

For sites requiring authentication or CAPTCHA solving:

# Interactive crawl with stored session
uv run article-extractor crawl \
  --seed https://protected-site.com \
  --headed \
  --storage-state ./session.json \
  --user-interaction-timeout 60 \
  --max-pages 20 \
  --output-dir ./protected-output

The crawler pauses on the first page to let you log in, then continues automatically.

API Endpoints¶

POST /crawl¶

Submit a new crawl job.

Request:

{
  "output_dir": "/data/output",
  "seeds": ["https://example.com"],
  "sitemaps": [],
  "allow_prefixes": [],
  "deny_prefixes": [],
  "max_pages": 100,
  "max_depth": 3,
  "concurrency": 5,
  "rate_limit_delay": 1.0,
  "follow_links": true,
  "network": {
    "headed": false,
    "storage_state": null
  }
}

Response (202 Accepted):

{
  "job_id": "abc-123",
  "status": "queued",
  "progress": 0,
  "total": 0
}

GET /crawl/{job_id}¶

Poll job status.

Response:

{
  "job_id": "abc-123",
  "status": "running",
  "progress": 25,
  "total": 100,
  "successful": 24,
  "failed": 1,
  "skipped": 0,
  "started_at": "2026-01-05T12:00:00Z"
}

Status values: queued, running, completed, failed

GET /crawl/{job_id}/manifest¶

Download the manifest.json for a completed job.

Response: The manifest JSON file.

Errors: - 404 if job not found - 400 if job not completed

Troubleshooting¶

Crawl hangs or times out¶

Reduce --concurrency to avoid overwhelming servers
Increase --rate-limit for rate-limited sites
Use --headed mode to debug JavaScript-heavy pages

Missing pages¶

Check manifest.json for failed/skipped entries
Verify URL filters aren't too restrictive
Ensure --max-depth is sufficient for deep hierarchies

Disk space warnings¶

The crawler warns if output directory has less than 100MB free. For large crawls:

# Check available space before crawling
df -h /path/to/output

# Use --max-pages to limit scope
uv run article-extractor crawl \
  --seed https://large-site.com \
  --max-pages 500 \
  --output-dir /data/large-crawl

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search