Building a Semantic Search Engine for Claude Code History

The Problem

After a few months of using Claude Code daily, I had hundreds of session files sitting in ~/.claude/projects/. Every session is a JSONL file full of messages, tool calls, file edits, and bash commands. Grep works if I remember the exact command I ran or the exact filename I edited, but it falls apart for anything conceptual — "that time I debugged the Docker networking issue" or "the session where I set up the Traefik reverse proxy."

I also started using OpenCode, which stores its sessions in a SQLite database at a completely different path with a different schema. Now my history was split across two tools with no unified way to search either of them.

So I built a semantic search tool that indexes everything and lets me search by meaning instead of keywords.

Architecture Overview

The pipeline looks like this:

Session Sources
├── Claude Code (~/.claude/projects/*/*.jsonl)
└── OpenCode (~/.local/share/opencode/opencode.db)
        │
        ▼
    Parser Plugins (one per source, common Turn interface)
        │
        ▼
    Two-Level Document Building
    ├── Session docs (all user messages, token-aware chunks)
    └── Turn docs (individual Q&A exchanges)
        │
        ▼
    Embedding (Qwen3-Embedding-0.6B, 1024-dim, 8K ctx)
    ├── Local: llama-cpp-python + Metal (default on macOS)
    └── Remote: any OpenAI-compatible /v1/embeddings endpoint
        │
        ▼
    Index on Disk (~/.cache/session-search/index/)
    ├── session_embeddings.npy + sessions.json  (vector)
    ├── turn_embeddings.npy + turns.json        (vector)
    ├── index.db                                (SQLite FTS5 for BM25)
    └── index-state.json                        (staleness tracking)
        │
        ▼
    Hybrid Search (Relative Score Fusion, α=0.7 semantic / 0.3 keyword)

No vector database, no framework. Just numpy arrays for vectors, SQLite FTS5 for keyword search, and JSON metadata — all under ~/.cache/session-search/. The only heavy dependency is llama-cpp-python, which embeds the model in-process with Metal acceleration on macOS so there's no separate server to run.

Three Search Levels

The core feature is a three-level drill-down that matches how I actually think about finding past work.

Level 1: Session Search

"Which conversation was it?" This searches session-level embeddings — each session is represented by all its user messages concatenated together. Good for finding which project and which session touched a topic.

$ scripts/search.sh search "Docker networking bridge configuration"

SESSION SEARCH: "Docker networking bridge configuration"
Searched 342 sessions across 28 projects (claude+opencode)
============================================================

[0.724] [claude]  session=a1b2c3d4e5f6  project=homelab  date=2026-01-15  turns=23
  slug: docker-network-debug
  preview: The containers on the backend network can't reach each other...
  files: docker-compose.yaml, nginx.conf, traefik.toml

[0.681] [opencode]  session=f7e8d9c0b1a2  project=infra-setup  date=2026-01-08  turns=12
  preview: I need to set up a bridge network that allows...
  files: docker-compose.yaml, .env

Drill deeper: search "<query>" --depth 2 --session <session_id>

Level 2: Turn Search

"What was the exact exchange?" This searches individual turns — each turn is a user message paired with the assistant's response. You can scope it to a specific session from Level 1, or search globally across all turns.

$ scripts/search.sh search "Docker networking" --depth 2 --session a1b2c3d4e5f6

TURN SEARCH: "Docker networking"
Scope: session a1b2c3d4e5f6
============================================================

[0.812] [claude]  turn=a1b2c3d4e5f6:7  (8/23)  project=homelab  date=2026-01-15
Q: The containers still can't ping each other. I verified the network exists...

A: The issue is that your containers are on the default bridge network, not your
   custom one. The default bridge doesn't support DNS resolution between containers...
   tools: Bash, Read, Edit

Deep context: show a1b2c3d4:5-9

Level 3: Show

"Give me the full context." This isn't a search — it re-parses the raw session and shows the complete conversation including tool calls, file edits, bash commands, and their outputs.

$ scripts/search.sh show a1b2c3d4:5-9

============================================================
DEEP CONTEXT: homelab / docker-network-debug
Showing turns 5-9 of 23 total
============================================================

--- Turn 5 (2026-01-15) ---
USER: Can you check what networks Docker currently has?
ASSISTANT: Let me check the current Docker networks.
  [Bash] docker network ls
  -> NETWORK ID     NAME      DRIVER    SCOPE
     a1b2c3d4e5f6   bridge    bridge    local
     f7e8d9c0b1a2   host      host      local
     ...

--- Turn 6 (2026-01-15) ---
USER: None of my containers are on the custom network...

The escalation strategy is: Level 1 to find the session, Level 2 to find the exact turn, Level 3 to see everything that happened around it.

Hybrid Search: Vector + BM25

Pure vector search turned out to miss too many literal terms — proper nouns, CLI flag names, exact config keys. "Traefik" or --no-verify would get semantically paraphrased into irrelevant neighbors. So every document that gets embedded also lands in a SQLite FTS5 table for BM25 keyword search, and every query runs both searches in parallel.

The two result sets get combined with Relative Score Fusion: vector cosine similarity and BM25 scores are each min-max normalized to 0–1, then blended with a tunable weight α.

ALPHA = 0.7  # 70% semantic, 30% keyword

def relative_score_fusion(vector_results, bm25_results, id_key_fn, alpha=ALPHA):
    norm_vec = _min_max_normalize(vector_results)
    norm_bm25 = _min_max_normalize(bm25_results)
    all_ids = set(norm_vec) | set(norm_bm25)
    return sorted(
        ((doc_id,
          alpha * norm_vec.get(doc_id, 0.0) + (1 - alpha) * norm_bm25.get(doc_id, 0.0))
         for doc_id in all_ids),
        key=lambda x: x[1], reverse=True,
    )

Unlike Reciprocal Rank Fusion (which throws away score magnitude and only uses rank positions), RSF preserves how confident each backend is, so the fused number is actually meaningful for thresholding — above 0.65 is a strong match, 0.45–0.65 is moderate, below that is weak. A --vector-only flag disables BM25 if you want to see what pure semantic search looks like.

Embedding Pipeline

The tool uses Qwen3-Embedding-0.6B — a 600M parameter embedding model that produces 1024-dimensional vectors with an 8K token context window. By default it runs in-process via llama-cpp-python with Metal acceleration, and the GGUF file is auto-downloaded from HuggingFace on first run and cached under ~/.cache/session-search/models/. No separate server, no network hop.

For non-Mac machines, or anyone who'd rather hit an existing embedding server, setting EMBEDDINGS_URL switches the tool to remote mode and posts batches to any OpenAI-compatible /v1/embeddings endpoint. A .env file in the skill directory holds the URL, API key, model name, and EMBEDDINGS_MAX_TOKENS — the last one drives chunk sizing automatically at 80% of the model's context window, so swapping models doesn't require code changes.

Token-Aware Chunking and Normalization

Long sessions get split into chunks sized to 80% of the model's context window (MAX_TOKENS = int(EMBEDDINGS_MAX_TOKENS * 0.8)), split at turn boundaries so no single Q/A exchange ever gets cut in half. That 80% ceiling leaves headroom for the query instruction prefix and a safety margin against token-counting drift between the tokenizer and the embedding server.

Remote mode batches requests concurrently via ThreadPoolExecutor and checkpoints progress every 100 documents so an interrupted indexing run resumes where it left off instead of starting over. If a batch fails with a 400 or 500, the tool falls back to embedding texts individually, substituting zero vectors for any that still fail. After all embeddings are collected, they get L2-normalized so that cosine similarity becomes a simple dot product:

arr = np.array(all_embeddings, dtype=np.float32)
norms = np.linalg.norm(arr, axis=1, keepdims=True)
norms = np.maximum(norms, 1e-10)
return arr / norms

Search then just needs a matrix multiply:

def search_vectors(query_vec, embeddings, top_k=5, mask=None):
    scores = embeddings @ query_vec
    top_idx = np.argpartition(scores, -k)[-k:]
    top_idx = top_idx[np.argsort(scores[top_idx])[::-1]]
    return [(int(indices[i]), float(scores[i])) for i in top_idx]

Queries get a special instruction prefix before embedding — this is how Qwen3's asymmetric retrieval works. Documents are embedded as-is, but queries get prepended with an instruction that tells the model what kind of retrieval to do:

QUERY_INSTRUCTION = "Instruct: Given a search query, retrieve relevant conversation excerpts that answer the query\nQuery: "

Incremental Indexing

Rebuilding the entire index every time would be slow, so the tool tracks what's already been indexed and only processes changes.

For Claude Code, it stores the mtime and file size of each JSONL file in index-state.json:

stat = jsonl_path.stat()
file_info = {"mtime": stat.st_mtime, "size": stat.st_size}
if not args_full and key in state and state[key] == file_info:
    unchanged_keys.add(key)
else:
    files_to_process.append(jsonl_path)
    state[key] = file_info

When merging, the tool keeps the existing embeddings for unchanged sessions and only embeds the new/changed ones, then merges them with np.vstack:

def merge_arrays(kept_idx, existing, new):
    if kept_idx and existing is not None:
        kept = existing[kept_idx]
        return np.vstack([kept, new]) if new.size else kept
    return new

Before every search, there's an auto-staleness check. If anything has changed since the last index, it runs an incremental re-index automatically:

def auto_reindex(local=False):
    if not index_is_stale():
        return
    print("Index is stale, updating...", file=sys.stderr)
    cmd_index(IndexArgs())

This means you never have to think about whether the index is current — it just stays up to date.

Supporting Both Claude Code and OpenCode

This was the trickiest part. Claude Code stores sessions as JSONL files — one JSON object per line, with each line containing a message and metadata. OpenCode stores sessions in a SQLite database with a completely different schema: separate tables for sessions, messages, and parts (sub-message content blocks).

Rather than branching on source type throughout the codebase, each tool gets its own parser module under scripts/parsers/. They share a common BaseParser interface that yields Turn objects with a normalized shape, so the indexer, chunker, and search engine don't know or care where a turn came from. Adding support for a new AI coding tool is a matter of dropping in a new parser file — no changes anywhere else.

Schema Differences

Claude Code messages look like:

{"type": "human", "message": {"role": "user", "content": [{"type": "text", "text": "..."}]}, "timestamp": "2026-01-15T10:30:00Z"}

OpenCode has a normalized relational structure: a session table with project references, a message table with role and timing, and a part table where actual content lives (text blocks, tool calls with their state). Tool inputs use camelCase (filePath) instead of snake_case (file_path), and timestamps are millisecond integers instead of ISO strings.

The tool normalizes these differences during parsing. For example, OpenCode tool names are lowercased single words (bash, read, edit) while Claude Code uses capitalized names (Bash, Read, Edit):

OPENCODE_TOOL_MAP = {
    "bash": "Bash", "read": "Read", "write": "Write", "edit": "Edit",
    "glob": "Glob", "grep": "Grep", "list": "LS", "fetch": "WebFetch",
}

And input keys get normalized inline:

if isinstance(inp, dict) and "filePath" in inp:
    inp = dict(inp)
    inp["file_path"] = inp.pop("filePath")

Incremental Indexing Per Source

The incremental indexing strategy differs by source. Claude Code is simple — compare file mtime and size. OpenCode is trickier because all sessions live in one SQLite file. The tool tracks both the database's overall mtime and each session's time_updated field:

if args_full or db_mtime != oc_state.get("db_mtime"):
    for sess in all_sessions:
        if (not args_full
                and sess["id"] in oc_session_state
                and sess["time_updated"] == oc_session_state[sess["id"]]):
            unchanged_keys.add(key)
        else:
            sessions_to_process.append(sess)

If the DB file hasn't changed at all, skip everything. If it has, check individual sessions by their time_updated timestamp to avoid re-parsing sessions that weren't touched.

Filtering Noise

Both sources produce a lot of noise that would pollute the index. The tool skips:

Tool-only messages with no real user text (just tool results being passed back)
Short messages under 30 characters (MIN_USER_CHARS = 30)
JSON blob messages (system metadata that starts with { and contains "type")
Internal Claude Code event types like file-history-snapshot, progress, queue-operation

For mega-sessions that exceed the model's context window, session documents get split at turn boundaries into token-bounded chunks (~6500 tokens by default). Each chunk gets its own embedding but shares the same session metadata, and results are deduplicated by session ID during search so one big conversation doesn't flood the top results.

Integration as a Claude Code Skill

The tool is wired into Claude Code as a custom skill, packaged as a self-contained directory that anyone can drop into ~/.claude/skills/:

session-search/
├── SKILL.md
├── scripts/
│   ├── search.sh          # dispatcher — sources .env, activates venv
│   ├── session_search.py  # CLI entry
│   ├── search_engine.py   # hybrid search + RSF fusion
│   ├── index_store.py     # embeddings + FTS5 persistence
│   ├── embeddings.py      # local llama-cpp-python + remote client
│   └── parsers/
│       ├── base.py
│       ├── claude_code.py
│       └── opencode.py
└── references/
    ├── HOW_IT_WORKS.md
    ├── CONFIGURATION.md
    ├── FIRST_RUN.md
    ├── ADDING_PARSERS.md
    └── TROUBLESHOOTING.md

The SKILL.md file defines what it is and what tools it's allowed to use:

---
name: session-search
description: Hybrid semantic + keyword search across AI coding tool session history
---

Invoking /session-search <query> triggers the skill, which walks through the escalation levels. The references/ docs are loaded lazily by Claude only when the user asks a matching question — "how does scoring work?" pulls in HOW_IT_WORKS.md, "how do I add a new source?" pulls in ADDING_PARSERS.md. That way the main SKILL.md stays small and doesn't burn context on every invocation.

On first run, the skill walks the user through a short setup: local vs remote embedding model, which endpoint to use if remote, and whether to disable Claude Code's 30-day session cleanup so older history stays searchable.

Closing Thoughts

This has become one of my most-used tools. Being able to search hundreds of sessions by meaning — "how did I fix that CORS issue" or "the session where I configured Tailscale" — and have BM25 quietly catch the exact proper nouns semantic search would have missed is a huge quality-of-life improvement.

The full index across ~660 sessions and ~5500 turns weighs in around 25 MB on disk. Building it from scratch takes a few minutes on a MacBook Air running the model locally via Metal, and incremental re-indexes after that take well under a second. The only hard dependency is llama-cpp-python — everything else is standard library plus numpy — and the remote-endpoint path still works for anyone who'd rather offload embedding to a GPU elsewhere.