Skip to content

Concepts

Optional reading. The library works fine if you skip this page — but understanding the moving parts helps when you want to tune results.

Asset types

Agent Library tags every file with an asset type:

Asset type What's in it Parser
text Markdown, plain text built-in
code Source code (Python, JS, TS, Go, Rust, …) regex-based symbol extractor
pdf PDF documents requires pypdf (in [all] extras)
image PNG, JPG, GIF, WEBP requires Pillow (in [all] extras)
multimodal Reserved for documents that mix modalities (none of today's parsers emit this — but the enum value exists for forward compatibility, and --asset_type=multimodal is accepted as a filter)

When you search, the asset type is preserved on every result. You can filter on it:

librarian search "encrypt" --type code

Search modes

Three modes, chosen with --mode:

  • keyword — pure full-text search, BM25-ranked. Best for exact phrases or unique tokens. Fast.
  • semantic — pure embedding similarity (cosine distance against a sentence-transformer model). Finds meaning matches even when the wording differs. Slower; loads ~100 MB of model on first use.
  • hybrid (default) — runs both and merges. Each modality normalizes its scores to [0, 1]; the merger gives a small overlap bonus to chunks that match across modalities. This is what you want most of the time.

When ENABLE_CROSS_MODAL_SEARCH=true (the default), hybrid also runs separate embedding models for code (CodeBERT) and images (CLIP) when those extras are installed.

Chunking

Documents are split into chunks before indexing. Each chunk is what the search returns — a passage, not a whole file. This keeps results focused and gives you snippet-level scores instead of file-level.

The chunker is asset-type aware:

  • Markdown is split by headers (H1/H2) and paragraphs
  • Code is split by symbol (function, class, method)
  • PDFs are split by page
  • Images become a single chunk with metadata

Scoring & MMR

Results are scored 0 → 1. The blend is controlled by two knobs:

  • HYBRID_ALPHA (default 0.7): in non-cross-modal hybrid, the formula is alpha * vector_score + (1 - alpha) * keyword_score. Higher = lean on semantic match more.
  • MMR_LAMBDA (default 0.7): after blending, Maximal Marginal Relevance picks the top-K with a diversity bias. The formula is lambda * relevance - (1 - lambda) * max_similarity_to_already_selected. Lower = more diverse top-K (might miss the second-best answer if it looks too much like the first); higher = more relevance-focused.

You can tweak both via environment variables or librarian config.

What lives where

File What it is
~/.librarian/index.db SQLite database with the FTS5 index, vector embeddings, and document metadata. Survives across sessions.
~/.librarian/sources.json The list of registered sources (managed by librarian add / rm).
~/.librarian/documents/ Default location for content created via add_to_library from inside the MCP server (when no directory is given).

Delete index.db and re-run librarian add ... to rebuild from scratch.

MCP under the hood

When an AI assistant calls Agent Library, it speaks the Model Context Protocol — a JSON-RPC convention defined by Anthropic. librarian serve stdio talks MCP over stdin/stdout; librarian serve http talks MCP over HTTP streaming.

The server advertises 9 tools:

Tool Purpose
Librarian_SearchLibrary The main thing — find content
Librarian_ReadFromLibrary Read a full document by path
Librarian_AddToLibrary Save new content into the library
Librarian_UpdateLibraryDoc Replace a document's content
Librarian_RemoveFromLibrary Drop a document from the index
Librarian_ListLibraryContents List indexed documents
Librarian_IndexDirectoryToLibrary Bulk-index a directory
Librarian_GetLibraryOverview Inspect the library (sections / stats / tree)
Librarian_SuggestLibraryLocation Recommend where new content belongs

Each takes typed arguments, returns typed JSON. The MCP host (Claude, Cursor) shows them under the server's name in its tool picker.