How-To: Debug Crawlers¶

Goal: Diagnose and fix synchronization failures for online and git tenants.
Prerequisites: Docker container running, basic command-line familiarity.
Time: ~10-30 minutes depending on issue

Quick Diagnosis¶

1. Check Container Health¶

curl -s http://localhost:42042/health | jq '{status, tenant_count}'

You should see "status": "healthy" and your configured tenant count.

2. Check Container Logs¶

# Recent logs for specific tenant
docker logs docs-mcp-server 2>&1 | grep -i "<tenant>" | tail -30

# All errors
docker logs docs-mcp-server 2>&1 | grep -iE "error|exception|failed" | tail -20

3. Force Re-sync¶

uv run python trigger_all_syncs.py --tenants <tenant> --force

Common Issues¶

Sync stuck at syncing status¶

Symptoms

Status shows "syncing" for extended period (>10 minutes for small tenants).

Diagnosis:

docker logs docs-mcp-server 2>&1 | grep -i "<tenant>" | tail -50

Fixes: 1. Rate limiting: Site may be throttling requests - Set max_crawl_pages lower (e.g., 500) - Add delay between syncs via refresh_schedule

Container resource exhaustion:
```
docker stats docs-mcp-server
```
If memory/CPU pegged, restart container.
Network timeout:
Increase http_timeout in infrastructure settings
Check if site is accessible from container

0 documents after sync¶

Symptoms

Sync completes but documents_count: 0.

Diagnosis:

# Check if sitemap is accessible
curl -s "https://docs.example.com/sitemap.xml" | head -20

# Check whitelist matches actual URLs
curl -s "https://docs.example.com/sitemap.xml" | grep "<loc>" | head -5

Fixes: 1. Wrong whitelist prefix:

// Wrong: sitemap has /en/stable/ but whitelist has /en/5.2/
"url_whitelist_prefixes": "https://docs.example.com/en/5.2/"

// Fix: match actual URL structure
"url_whitelist_prefixes": "https://docs.example.com/en/stable/"

All URLs blacklisted: Review url_blacklist_prefixes
JavaScript-rendered content: Enable Playwright in infrastructure:
```
"crawler_playwright_first": true
```

Git sync failed - repository not found¶

Symptoms

Git tenant sync fails with authentication or URL error.

Diagnosis:

# Test URL accessibility
git ls-remote https://github.com/org/repo.git

# Check if token is set (for private repos)
docker exec docs-mcp-server printenv | grep -i token

Fixes: 1. Invalid URL: Ensure URL ends with .git and uses HTTPS 2. Private repo without token:

"git_auth_token_env": "GH_TOKEN"

And pass token to container: docker run -e GH_TOKEN=... ... 3. Wrong branch: Verify branch exists in repository

Search returns no results after sync¶

Symptoms

Sync shows documents_count > 0 but search returns empty.

Diagnosis:

# Check if index exists
ls -la mcp-data/<tenant>/__search_segments/

Fixes: 1. Index not built: Rebuild manually

uv run python trigger_all_indexing.py --tenants <tenant>

Stale segments: Clean and rebuild

uv run python cleanup_segments.py --tenant <tenant>
uv run python trigger_all_indexing.py --tenants <tenant>

Crawler getting blocked¶

Symptoms

HTTP 403/429 errors in logs, partial or no content.

Diagnosis:

# Check for rate limiting
docker logs docs-mcp-server 2>&1 | grep -E "429|403|blocked" | tail -10

Fixes: 1. Reduce crawl rate: - Lower max_crawl_pages - Use sitemap instead of crawler when possible - Space out refresh_schedule (e.g., every 14 days instead of daily)

Use sitemap: Some sites block crawlers but provide sitemaps

"enable_crawler": false,
"docs_sitemap_url": "https://docs.example.com/sitemap.xml"

Crawler lock contention¶

Symptoms

/tenant/sync/status shows crawler_lock_status: contended for several minutes and no new crawl logs appear.

Diagnosis: Call /tenant/sync/status and inspect stats.crawler_lock_status, crawler_lock_owner, and crawler_lock_expires_at. If another worker is crawling, the status stays contended until the TTL expires (default 180 s).

Fixes: 1. Wait for TTL: The lease auto-expires based on crawler_lock_ttl_seconds (minimum 60 s). The next sync run rechecks freshness before crawling again. 2. Manual cleanup (advanced): Stop the server and delete mcp-data/<tenant>/__scheduler_meta/locks/crawler.lock only if you are sure no crawler is running. 3. Adjust TTL: Set "crawler_lock_ttl_seconds": 300 in infrastructure settings if crawls routinely exceed three minutes. 4. Verify freshness: If status stays stale, check last_sync_at—the scheduler skips reruns when the tenant already refreshed within one schedule interval.

Adaptive concurrency behavior¶

Understanding Concurrency Stats

Crawler uses adaptive concurrency to maximize throughput while respecting rate limits. Check /tenant/sync/status to see current behavior.

Concurrency Stats (from /tenant/sync/status):

{
  "current_limit": 12,      // Current active worker ceiling
  "peak_limit": 20,         // Highest limit reached this session
  "active_workers": 8,      // Workers currently fetching pages
  "peak_active": 15         // Peak concurrency reached
}

How It Works: - Starts at min: Initial concurrency = crawler_min_concurrency (default 5) - Ramps up: After 25 successful fetches + 60s without 429s, adds 1 worker slot - Backs off: On 429 response, immediately halves limit (min floor enforced) - Caps at max: Never exceeds crawler_max_concurrency (default 20)

Tuning Environment Variables:

Set in deployment.json infrastructure section:

{
  "crawler_min_concurrency": 10,    // Floor (1-100)
  "crawler_max_concurrency": 30,    // Ceiling (1-100)
  "crawler_max_sessions": 50,       // Hard process limit (1-100)
  "crawler_lock_ttl_seconds": 240   // Lock TTL (≥60)
}

Diagnosing Low Throughput:

# Check if stuck at min_limit
curl -s http://localhost:42042/<tenant>/sync/status | jq '{
  current_limit: .stats.current_limit, 
  active_workers: .stats.active_workers,
  urls_processed: .stats.urls_processed
}'

If current_limit == crawler_min_concurrency and no 429s in logs, possible causes: 1. Rate limiter aggressive: Check AdaptiveRateLimiter delays in logs 2. Host slow: Network latency prevents workers from saturating semaphore 3. Small queue: Frontier exhausted before adaptive ramp-up completes

Forcing Higher Concurrency (risky):

// Bypass gradual ramp-up by starting higher
"crawler_min_concurrency": 15,
"crawler_max_concurrency": 15  // Same value = no adaptation

⚠️ This disables adaptive throttling—use only when confident the host can handle it.

Debugging Tools¶

Test Specific Tenant Locally¶

uv run python debug_multi_tenant.py --tenant <tenant> --test all

Output shows search results, fetch tests, and any errors encountered.

Inspect Cached Documents¶

# List cached files
ls mcp-data/<tenant>/ | head -20

# View a cached document
cat "mcp-data/<tenant>/some-page.md" | head -50

Check Search Index¶

# List index segments
ls -lah mcp-data/<tenant>/__search_segments/

# Rebuild index
uv run python trigger_all_indexing.py --tenants <tenant>

Manual HTTP Test¶

# Test URL directly
curl -sI "https://docs.example.com/getting-started/"

# Test via container
docker exec docs-mcp-server curl -sI "https://docs.example.com/getting-started/"

Log Levels¶

Increase verbosity for detailed debugging:

# Set in deployment.json infrastructure section
"log_level": "debug"

# Or via environment variable
docker run -e LOG_LEVEL=debug ...

Then check logs:

docker logs -f docs-mcp-server 2>&1 | grep <tenant>

Recovery Steps¶

Full Tenant Reset¶

If all else fails, reset the tenant completely:

# 1. Stop container
docker stop docs-mcp-server

# 2. Remove tenant data
rm -rf mcp-data/<tenant>

# 3. Restart container
docker start docs-mcp-server

# 4. Trigger fresh sync
uv run python trigger_all_syncs.py --tenants <tenant> --force

# 5. Wait for sync, then rebuild index
uv run python trigger_all_indexing.py --tenants <tenant>

How-To: Trigger Syncs — Force refresh documentation
How-To: Configure Online Tenant — Setup guidance
How-To: Configure Git Tenant — Git-specific setup
Reference: CLI Commands — Debug and sync scripts

How-To: Debug Crawlers¶

Quick Diagnosis¶

1. Check Container Health¶

2. Check Container Logs¶

3. Force Re-sync¶

Common Issues¶

Sync stuck at syncing status¶

0 documents after sync¶

Git sync failed - repository not found¶

Search returns no results after sync¶

Crawler getting blocked¶

Crawler lock contention¶

Adaptive concurrency behavior¶

Debugging Tools¶

Test Specific Tenant Locally¶

Inspect Cached Documents¶

Check Search Index¶

Manual HTTP Test¶

Log Levels¶

Recovery Steps¶

Full Tenant Reset¶

Related¶