Tutorial: Custom Search Configuration¶
Time: ~20 minutes
Prerequisites: Completed Getting Started, at least one tenant synced
What you'll learn: Configure BM25 parameters, test search quality, and optimize for your documentation
Read This First
The default BM25 configuration works globally: - 7 tenants tested across diverse content types - 150K+ documents indexed - 210 test queries validated
Only tune if you have specific ranking problems.
Adding custom search logic = maintenance burden. Verify gains justify effort.
Good reasons to tune: - Test queries consistently rank poorly (<0.3 precision@3) - Domain-specific terms need special handling (medical, legal) - Your corpus is uniquely structured (all 1-line refs vs. all long tutorials)
Bad reasons to tune: - "Feels like it could be better" without metrics - One query out of 50 fails (fix content, not algorithm) - Cargo-culting params from another project
Overview¶
docs-mcp-server uses BM25 for ranking search results. While defaults work well for most documentation, you may want to tune parameters for specific use cases.
In this tutorial, you'll: 1. Understand how BM25 ranking works 2. Test your current search quality 3. Adjust parameters to improve results 4. Validate improvements with test queries
Step 1: Understand Current Behavior¶
First, establish a baseline by running test queries:
Note the results: - Which documents rank 1-3? - What are the scores? - Are the top results actually the most relevant?
Run 5-10 different queries (edit test_queries in deployment.json) to document which results seem "off".
Step 2: Understand BM25 Parameters¶
BM25 has two main tuning knobs:
k1 (Term Saturation)¶
Controls how much additional term occurrences matter.
- Low (0.5-1.0): First occurrence of a term is most important; more occurrences add diminishing value
- High (1.5-3.0): Documents with many keyword occurrences rank higher
When to adjust: - Lower k1 if short documents with one mention should rank high - Raise k1 if comprehensive documents covering a topic should rank first
b (Length Normalization)¶
Controls how document length affects ranking.
- Low (0.1-0.4): Long and short documents treated more equally
- High (0.7-1.0): Short documents are boosted relative to long ones
When to adjust: - Lower b if your best content is in detailed, long documents - Raise b if concise docs (quick reference, cheat sheets) should rank first
Step 3: Create a Test Query Set¶
Before changing anything, create a structured test file:
cat > search_tests.json << 'EOF'
{
"tenant": "drf",
"queries": [
{
"query": "serializer validation",
"expected_top_3": [
"serializers.md",
"validation.md"
]
},
{
"query": "viewset permissions",
"expected_top_3": [
"permissions.md",
"viewsets.md"
]
}
]
}
EOF
Run baseline tests using the debug script, which uses test_queries from your tenant config:
Step 4: Modify Search Configuration¶
Edit deployment.json to add search configuration for your tenant:
{
"source_type": "online",
"codename": "drf",
"docs_name": "Django REST Framework Docs",
"docs_sitemap_url": "https://www.django-rest-framework.org/sitemap.xml",
"url_whitelist_prefixes": "https://www.django-rest-framework.org/",
"docs_root_dir": "./mcp-data/drf",
"search": {
"ranking": {
"bm25_k1": 1.5,
"bm25_b": 0.6
},
"boosts": {
"title": 3.0,
"headings_h1": 2.5,
"code": 1.5
}
}
}
Changes made:
- bm25_k1: 1.5 — Slightly favor documents with more keyword occurrences
- bm25_b: 0.6 — Reduce penalty for long documents
- title: 3.0 — Boost title matches more heavily
- code: 1.5 — Give more weight to code examples
Step 5: Apply Changes¶
# Redeploy with new configuration
uv run python deploy_multi_tenant.py --mode online
# Rebuild search index (required after parameter changes)
uv run python trigger_all_indexing.py --tenants drf
Step 6: Compare Results¶
Run the same test queries:
Compare to baseline: - Did relevant documents move up? - Are scores more differentiated? - Did any good results drop?
Step 7: Field Boosts Explained¶
Control which parts of documents matter most:
{
"search": {
"boosts": {
"title": 2.5,
"headings_h1": 2.5,
"headings_h2": 2.0,
"headings": 1.5,
"body": 1.0,
"code": 1.2,
"path": 1.5,
"url": 1.5
}
}
}
Boost Strategies¶
For API Reference Docs:
For Tutorial Content:
Step 8: Snippet Configuration¶
Adjust how search result snippets appear:
{
"search": {
"snippet": {
"style": "plain",
"fragment_char_limit": 300,
"max_fragments": 3,
"surrounding_context_chars": 150
}
}
}
Test snippet quality by running the debug script and examining the snippet field in results:
Step 9: Analyzer Profiles¶
Choose tokenization strategy:
| Profile | Use When |
|---|---|
default |
General documentation |
aggressive-stem |
Many word variations (run/running/runs) |
code-friendly |
Technical docs with identifiers |
Step 10: Validate with Test Queries¶
Use test_queries in your tenant config for automated validation:
{
"test_queries": {
"natural": [
"How to create a serializer",
"Viewset permissions tutorial"
],
"phrases": ["serializer", "viewset", "permission"],
"words": ["drf", "api", "rest"]
}
}
Run automated tests:
Verification¶
You should now have
- [x] Baseline search quality documented
- [x] Custom BM25 parameters configured
- [x] Field boosts adjusted for your content type
- [x] Snippet settings optimized
- [x] Test queries validating improvements
Troubleshooting¶
Scores all look the same
Cause: Default parameters may be close to optimal.
Fix: Try more extreme values temporarily to see effect:
Short documents always win
Cause: b parameter too high.
Fix: Lower bm25_b to 0.4-0.5.
Code searches don't work well
Cause: Code tokens not weighted enough.
Fix: Use code-friendly analyzer and boost code:
Next Steps¶
- How-To: Tune Search Ranking — Quick reference for tuning
- Explanation: Search Ranking (BM25) — Deep dive into BM25
- Reference: deployment.json Schema — All search options