Explanation: Sync Strategies¶
The Problem¶
Documentation sources have different update characteristics:
| Source Type | Update Pattern | Challenge |
|---|---|---|
| Websites | Unpredictable changes | Need periodic crawls |
| Git repos | Explicit commits/tags | Sync on push events |
| Filesystem | Local dev files | Rarely change during session |
Forcing one sync approach wastes resources: - Over-crawling static content - Stale git data - User-facing "resync" buttons (poor UX)
Each source type needs a tailored strategy.
Strategy Overview¶
flowchart TB
subgraph Sources["Documentation Sources"]
W[("đ Website")]
G[("đŚ Git Repo")]
F[("đ Local Files")]
end
subgraph Strategies["Sync Strategies"]
W --> |"Sitemap/Crawler"| O[Online Sync]
G --> |"Sparse Checkout"| GS[Git Sync]
F --> |"Direct Read"| FS[Filesystem]
end
subgraph Pipeline["Common Pipeline"]
O --> M[Markdown Cache]
GS --> M
FS --> M
M --> I[BM25 Index]
I --> S[Search API]
end
style O fill:#4051b5,color:#fff
style GS fill:#2e7d32,color:#fff
style FS fill:#f57c00,color:#fff
style I fill:#7b1fa2,color:#fff
All strategies feed into the same pipeline: markdown cache â BM25 index â search API.
Three Sync Strategies¶
1. Online Sync (Web Crawling)¶
For websites with published documentation.
How it works: 1. Discover URLs from sitemap.xml or by crawling links 2. Fetch each page via HTTP 3. Extract content using article-extractor 4. Cache locally as markdown 5. Build search index
Best for Online Sync
- Official documentation sites (Django, FastAPI, Python stdlib)
- Sites with sitemap.xml
- Rendered HTML content
Trade-offs:
| Pros | Cons |
|---|---|
| Gets final rendered content | Rate limiting by site |
| Handles JavaScript-rendered pages | Slower sync (minutes to hours) |
| No repo access needed | May miss content behind auth |
Configuration:
{
"source_type": "online",
"docs_sitemap_url": "https://docs.example.com/sitemap.xml",
"enable_crawler": false,
"refresh_schedule": "0 2 */14 * *"
}
2. Git Sync (Repository Checkout)¶
For documentation stored in git repositories.
How it works: 1. Sparse checkout specified paths from repository 2. Copy markdown files to tenant storage 3. Track commit hash for change detection 4. Build search index from files
Best for Git Sync
- Project documentation in repos (MkDocs projects, READMEs)
- Internal documentation in private repos
- Docs that update frequently
Trade-offs:
| Pros | Cons |
|---|---|
| Fast, deterministic syncs | Only gets markdown source |
| Works offline after initial clone | Need repo access for private repos |
| Tracks exact commit | Doesn't render includes/templates |
Configuration:
{
"source_type": "git",
"git_repo_url": "https://github.com/org/repo.git",
"git_branch": "main",
"git_subpaths": ["docs"],
"refresh_schedule": "0 */6 * * *"
}
3. Filesystem (Local Files)¶
For documentation already on disk.
How it works: 1. Read markdown files from specified directory 2. No sync neededâfiles are already local 3. Build search index from files
Best for Filesystem Sync
- Your own project's documentation
- Pre-processed or generated content
- Air-gapped environments
Trade-offs:
| Pros | Cons |
|---|---|
| Instantâno network needed | You manage file updates |
| Works completely offline | Limited to local files |
| No external dependencies | No automatic refresh |
Configuration:
Decision Matrix¶
| Factor | Online | Git | Filesystem |
|---|---|---|---|
| Speed | Slow (minutes) | Fast (seconds) | Instant |
| Freshness | Depends on schedule | Tracks commits | Manual |
| Network | Required | Required (initial) | None |
| JavaScript | Supported | N/A | N/A |
| Private content | Needs auth | PAT token | Direct access |
| Best for | Public docs sites | Repo-based docs | Local docs |
When to Use Each¶
Use Online When...¶
- Documentation is on a public website
- You need the rendered, final content
- Site has a sitemap.xml
- Content includes diagrams, images rendered by the site
Example: Django official docs, FastAPI docs
Use Git When...¶
- Documentation lives in a GitHub/GitLab repo
- You want deterministic, version-tracked syncs
- Content is primarily markdown
- You're indexing many small projects
Example: MkDocs projects, AIDLC rules, any docs/ folder
Use Filesystem When...¶
- You already have the files locally
- You're building a CI/CD pipeline
- You need air-gapped operation
- You're developing documentation locally
Example: Your own project's docs during development
Hybrid Approach¶
You can mix strategies in one deployment:
{
"tenants": [
{
"source_type": "online",
"codename": "django",
"docs_sitemap_url": "https://docs.djangoproject.com/sitemap-en.xml"
},
{
"source_type": "git",
"codename": "internal-docs",
"git_repo_url": "https://github.com/org/internal.git"
},
{
"source_type": "filesystem",
"codename": "my-project",
"docs_root_dir": "./mcp-data/my-project"
}
]
}
Each tenant uses its optimal strategy independently.
Sync Scheduling¶
Online Tenants¶
Cron-based scheduling prevents overloading source sites:
Recommended intervals: - Large sites (Django, Python stdlib): Every 14 days - Medium sites: Weekly - Small sites: Every 2-3 days
Git Tenants¶
Git sync is lightweightâcan run more frequently:
Or use interval-based:
Filesystem Tenants¶
No scheduling neededâfiles are always current. If you update files externally, rebuild the index:
Implementation Details¶
Online: Two Discovery Methods¶
- Sitemap parsing: Reads sitemap.xml for all URLs
- Crawler: Follows links from entry URL
Sitemap is preferred when availableâfaster and respects site structure.
Git: Sparse Checkout¶
Uses git's sparse checkout to fetch only specified paths:
# Equivalent to:
git clone --depth 1 --filter=blob:none --sparse <url>
git sparse-checkout set docs/
Minimizes bandwidth and storage.
Filesystem: Direct Read¶
Reads all .md files from the specified directory. No network, no cachingâalways current.
Further Reading¶
- How-To: Configure Online Tenant
- How-To: Configure Git Tenant
- How-To: Debug Crawlers
- Reference: deployment.json Schema