Concepts¶
Optional reading. The library works fine if you skip this page — but understanding the moving parts helps when you want to tune results.
Asset types¶
Agent Library tags every file with an asset type:
| Asset type | What's in it | Parser |
|---|---|---|
text |
Markdown, plain text | built-in |
code |
Source code (Python, JS, TS, Go, Rust, …) | regex-based symbol extractor |
pdf |
PDF documents | requires pypdf (in [all] extras) |
image |
PNG, JPG, GIF, WEBP | requires Pillow (in [all] extras) |
multimodal |
Reserved for documents that mix modalities (none of today's parsers emit this — but the enum value exists for forward compatibility, and --asset_type=multimodal is accepted as a filter) |
— |
When you search, the asset type is preserved on every result. You can filter on it:
Search modes¶
Three modes, chosen with --mode:
keyword— pure full-text search, BM25-ranked. Best for exact phrases or unique tokens. Fast.semantic— pure embedding similarity (cosine distance against a sentence-transformer model). Finds meaning matches even when the wording differs. Slower; loads ~100 MB of model on first use.hybrid(default) — runs both and merges. Each modality normalizes its scores to [0, 1]; the merger gives a small overlap bonus to chunks that match across modalities. This is what you want most of the time.
When ENABLE_CROSS_MODAL_SEARCH=true (the default), hybrid also runs separate embedding models for code (CodeBERT) and images (CLIP) when those extras are installed.
Chunking¶
Documents are split into chunks before indexing. Each chunk is what the search returns — a passage, not a whole file. This keeps results focused and gives you snippet-level scores instead of file-level.
The chunker is asset-type aware:
- Markdown is split by headers (
H1/H2) and paragraphs - Code is split by symbol (function, class, method)
- PDFs are split by page
- Images become a single chunk with metadata
Scoring & MMR¶
Results are scored 0 → 1. The blend is controlled by two knobs:
HYBRID_ALPHA(default0.7): in non-cross-modal hybrid, the formula isalpha * vector_score + (1 - alpha) * keyword_score. Higher = lean on semantic match more.MMR_LAMBDA(default0.7): after blending, Maximal Marginal Relevance picks the top-K with a diversity bias. The formula islambda * relevance - (1 - lambda) * max_similarity_to_already_selected. Lower = more diverse top-K (might miss the second-best answer if it looks too much like the first); higher = more relevance-focused.
You can tweak both via environment variables or librarian config.
What lives where¶
| File | What it is |
|---|---|
~/.librarian/index.db |
SQLite database with the FTS5 index, vector embeddings, and document metadata. Survives across sessions. |
~/.librarian/sources.json |
The list of registered sources (managed by librarian add / rm). |
~/.librarian/documents/ |
Default location for content created via add_to_library from inside the MCP server (when no directory is given). |
Delete index.db and re-run librarian add ... to rebuild from scratch.
MCP under the hood¶
When an AI assistant calls Agent Library, it speaks the Model Context Protocol — a JSON-RPC convention defined by Anthropic. librarian serve stdio talks MCP over stdin/stdout; librarian serve http talks MCP over HTTP streaming.
The server advertises 9 tools:
| Tool | Purpose |
|---|---|
Librarian_SearchLibrary |
The main thing — find content |
Librarian_ReadFromLibrary |
Read a full document by path |
Librarian_AddToLibrary |
Save new content into the library |
Librarian_UpdateLibraryDoc |
Replace a document's content |
Librarian_RemoveFromLibrary |
Drop a document from the index |
Librarian_ListLibraryContents |
List indexed documents |
Librarian_IndexDirectoryToLibrary |
Bulk-index a directory |
Librarian_GetLibraryOverview |
Inspect the library (sections / stats / tree) |
Librarian_SuggestLibraryLocation |
Recommend where new content belongs |
Each takes typed arguments, returns typed JSON. The MCP host (Claude, Cursor) shows them under the server's name in its tool picker.