Skip to content

How-To: Configure Online Tenant

Goal: Add documentation from a website (via sitemap or crawler) to your deployment.
Prerequisites: Website URL with accessible documentation, deployment.json configured, Docker container running.
Time: ~15 minutes


When to Use Online Tenants

Choose online tenants when:

✅ Documentation is website-hosted (not a git repository)
✅ Rendered HTML matters (styled tables, syntax highlighting)
✅ Sitemap or crawlable structure exists
✅ Public access (no authentication required)

Examples: Django docs, FastAPI docs, FastMCP docs.

Trade-offs: - ⚡ Higher resource usage (crawler + HTML extraction) - ⏱️ Less control over update timing - 🌐 Requires network access during sync

When NOT to use: - Docs available in public git repo → Use git tenant (faster, lighter) - Content requires authentication - Site blocks crawlers (robots.txt)


Steps

1. Find the Sitemap URL

Most documentation sites have a sitemap. Common locations: - https://docs.example.com/sitemap.xml - https://example.com/docs/sitemap.xml

Check by visiting the URL directly. If no sitemap exists, you'll use the crawler (Step 2b).

{
  "source_type": "online",
  "codename": "django",
  "docs_name": "Django Docs",
  "docs_sitemap_url": "https://docs.djangoproject.com/sitemap-en.xml",
  "url_whitelist_prefixes": "https://docs.djangoproject.com/en/5.2/",
  "url_blacklist_prefixes": "https://docs.djangoproject.com/en/5.2/releases/",
  "enable_crawler": false,
  "docs_root_dir": "./mcp-data/django",
  "refresh_schedule": "0 2 */14 * *",
  "test_queries": {
    "natural": ["How to create a Django model"],
    "phrases": ["model", "view"],
    "words": ["django", "queryset"]
  }
}

2b. Configure with Crawler (No Sitemap)

When no sitemap is available, enable the crawler:

{
  "source_type": "online",
  "codename": "custom-docs",
  "docs_name": "Custom Documentation",
  "docs_entry_url": "https://docs.example.com/",
  "url_whitelist_prefixes": "https://docs.example.com/",
  "enable_crawler": true,
  "max_crawl_pages": 500,
  "docs_root_dir": "./mcp-data/custom-docs",
  "refresh_schedule": "0 4 * * 1"
}

Crawler Behavior

Crawler follows links from docs_entry_url, respecting url_whitelist_prefixes.

3. URL Filtering

Use prefixes to control what gets indexed:

{
  "url_whitelist_prefixes": "https://docs.djangoproject.com/en/5.2/",
  "url_blacklist_prefixes": "https://docs.djangoproject.com/en/5.2/releases/,https://docs.djangoproject.com/en/5.2/_"
}
  • Whitelist: Only URLs starting with these prefixes are indexed
  • Blacklist: URLs starting with these prefixes are excluded (even if whitelisted)
  • Multiple values: Comma-separated

Common Exclusions

  • /releases/ - Version changelogs
  • /_ - Internal/private pages
  • /api/ - Auto-generated API docs (if too verbose)

4. Redeploy and Sync

# Redeploy container
uv run python deploy_multi_tenant.py --mode online

# Trigger sync
uv run python trigger_all_syncs.py --tenants django --force

5. Monitor Sync Progress

Syncing large sites can take several minutes. Watch container logs:

docker logs -f docs-mcp-server 2>&1 | grep -i django

When sync completes, you'll see "Sync cycle completed" in the logs.

6. Verify Search Works

uv run python debug_multi_tenant.py --host localhost --port 42042 --tenant django --test search

7. (Optional) Enable Article Extractor Fallback

Some vendors expose JavaScript-heavy docs that Playwright still struggles to parse. You can turn on the shared fallback extractor by updating the infrastructure.article_extractor_fallback block in deployment.json:

{
  "infrastructure": {
    "article_extractor_fallback": {
      "enabled": true,
      "endpoint": "http://10.20.30.1:13005/",
      "timeout_seconds": 20,
      "max_retries": 2,
      "api_key_env": "DOCS_FALLBACK_EXTRACTOR_TOKEN"
    }
  }
}
  • endpoint points to the internal article-extractor service (the same payload schema as the bundled library).
  • api_key_env lets you reference an environment variable when the upstream requires authentication. Set that variable before starting the container so secrets never land in deployment.json.
  • Leave enabled off if the service is unreachable—the server now validates the endpoint at startup and fails fast when the network is misconfigured.

Verifying fallback usage: Check /sync/status or docker logs to confirm the fallback is working:

# Check fallback counters via sync status
curl -s http://localhost:42042/<tenant>/sync/status | jq '.fallback_extractor'

# Grep for fallback activity in container logs
docker logs docs-mcp-server 2>&1 | grep -i "fallback" | tail -10

# Check for successful extractions
docker logs docs-mcp-server 2>&1 | grep -iE "extracted|fetched" | tail -20

The /sync/status endpoint returns fallback_extractor.attempts, fallback_extractor.successes, and fallback_extractor.failures counters so you can confirm the fallback is rescuing pages that the primary Playwright pipeline cannot parse.


Configuration Reference

Field Required Description
source_type Yes Must be "online"
codename Yes Unique lowercase identifier
docs_name Yes Human-readable name
docs_sitemap_url Conditional Sitemap URL(s), comma-separated
docs_entry_url Conditional Entry URL(s) for crawler
url_whitelist_prefixes Recommended Include only matching URLs
url_blacklist_prefixes Optional Exclude matching URLs
enable_crawler Optional Enable link-following crawler
max_crawl_pages Optional Page limit (default: 10000)
refresh_schedule Optional Cron schedule for auto-sync

Note: At least one of docs_sitemap_url or docs_entry_url is required.


Examples

FastAPI Documentation

{
  "source_type": "online",
  "codename": "fastapi",
  "docs_name": "FastAPI Docs",
  "docs_sitemap_url": "https://fastapi.tiangolo.com/sitemap.xml",
  "url_whitelist_prefixes": "https://fastapi.tiangolo.com/",
  "url_blacklist_prefixes": "https://fastapi.tiangolo.com/release-notes/",
  "docs_root_dir": "./mcp-data/fastapi",
  "refresh_schedule": "0 4 */14 * *"
}

Python Standard Library

{
  "source_type": "online",
  "codename": "python",
  "docs_name": "Python Docs",
  "docs_sitemap_url": "https://docs.python.org/sitemap.xml",
  "url_whitelist_prefixes": "https://docs.python.org/3.13/,https://docs.python.org/3.14/",
  "url_blacklist_prefixes": "https://docs.python.org/3.13/whatsnew/",
  "docs_root_dir": "./mcp-data/python",
  "refresh_schedule": "0 5 1,15 * *"
}

Pytest Documentation (Crawler-based)

{
  "source_type": "online",
  "codename": "pytest",
  "docs_name": "Pytest Docs",
  "docs_sitemap_url": "https://docs.pytest.org/sitemap.xml",
  "docs_entry_url": "https://docs.pytest.org/en/stable/",
  "url_whitelist_prefixes": "https://docs.pytest.org/en/stable/",
  "enable_crawler": true,
  "docs_root_dir": "./mcp-data/pytest",
  "refresh_schedule": "0 9 */14 * *"
}

Troubleshooting

Sync stuck or very slow

Cause: Large site or rate limiting.

Fix: 1. Set max_crawl_pages to a reasonable limit 2. Narrow url_whitelist_prefixes to essential sections 3. Check container logs: docker logs docs-mcp-server 2>&1 | tail -50

No documents after sync

Cause: All URLs filtered out or site blocking requests.

Fix: 1. Verify URL is accessible: curl -I https://docs.example.com/ 2. Check whitelist covers actual URLs in sitemap 3. Try enabling crawler if sitemap URLs don't match content

JavaScript-rendered content missing

Cause: Some sites render content with JavaScript.

Fix: Ensure crawler_playwright_first: true in infrastructure settings (default is enabled).

Search returns irrelevant results

Cause: Too much content indexed, including low-value pages.

Fix: 1. Add more url_blacklist_prefixes for changelogs, API refs, etc. 2. Re-sync: uv run python trigger_all_syncs.py --tenants <tenant> --force 3. Rebuild index: uv run python trigger_all_indexing.py --tenants <tenant>