Problem
Search lives in a dropdown. You type, you get title matches, you pick one. It works for "I know the name of the thing I want." It falls apart for everything else.
Try searching for a concept that spans multiple documents. Or finding something you wrote last week but can't remember the title of. Or figuring out which version of a doc had that specific paragraph.
We need a dedicated search page. But the current search API wasn't built for that — Now with the addition of semantic search and RRF combining, we can make a full search experience super useful.
Solution
1. IRI Filter — Scope Search to a Document or Subpath
Currently, search is scoped to either an entire account (for web search) or the whole library. However, With IRI filtering, users can narrow search to a specific document or folder subpath.
Examples:
Search only within a single document: hm://<account>/cars/honda
Search within a subpath: hm://<account>/cars/* (all documents under "cars")
Leave empty to search the entire account (current behavior)
This lets the search page offer a "search within" dropdown scoped to the user's current location in the document tree.
2. Content-Type Filter — Choose What Gets Searched
Today the search either looks at titles only, or titles + document bodies. The new content-type filter gives fine-grained control over which content types are included:
Document — document body content
Comment — comments on documents
Contact — contact/profile information
Title — document titles
The search page can expose these as checkboxes or a filter bar. When no filter is selected, the existing behavior is preserved.
3. Authority Ranking — Citation-based Result Quality
Opt-in ranking signal that uses citation data to surface more authoritative results. Two signals are blended into the existing search ranking:
Document authority — how many other documents cite/link to this document
Author authority — how many external citations the document's author has received across all their work (self-citations excluded)
When enabled, the ranking weights become:
Semantic similarity 35%
Keyword match 35%
Document citations 20%
Author citations 10%
Why exclude self-citations? Testing showed one author had 98% self-citations, inflating their score from 4 to 227. Filtering self-citations keeps the signal honest.
Performance: Authority scores are computed on-the-fly from existing indexed data — ~7ms for 200 documents. No precomputation or caching needed.
4. Semantic Dedup — Remove Near-Duplicate Versions
Problem: When a document has multiple versions with minor edits (e.g., "cars" changed to "cars."), search returns both versions as separate results even though they're semantically identical.
Solution: For semantic and hybrid search modes, group results by document + block + content type, then compare how similarly each version matches the query. If two consecutive versions score within 20% of each other, only the newest version is kept.
Versions with meaningfully different content (>20% score difference) are both preserved
Keyword-only search keeps the existing exact-match dedup (appropriate since it's character-level matching)
This reduces clutter from minor edits without hiding genuinely different content across versions.
Results Visualization
The dedicated search page displays results as a vertical scrollable list of cards. Each card contains:
Document title — clickable, navigates to the document
Full path breadcrumb — e.g. My Account / cars / honda / civic showing the document's position in the hierarchy
Version indicator — which version matched (timestamp or version label)
Content snippet — preview of the matched block with query terms highlighted inline
Cards stack vertically in a single column, optimized for scanability and variable-length snippets.
Title-only matches show the first content block as a fallback snippet.
Scope
4 backward-compatible additions to the existing search API
Backend-only changes — no database migrations, no new tables
Rabbit Holes
Materialized authority caches — not needed, batch queries are fast enough on indexed data
Configurable ranking weights or A/B testing — premature; hardcoded constants for now
Complex dedup strategies (e.g., per-paragraph diffing) — percentage threshold on query score is simpler and sufficient
No Gos
Open Question
How should search type (keyword / semantic / hybrid) be exposed on the search page?
Explicit toggle — full control, potentially confusing for non-technical users
Smart defaults with override — hybrid for search page, keyword for dropdown
Backend heuristic — auto-select based on query length
Always hybrid — simplest ¿?
This doesn't block backend work since the search type selection already exists in the API.