Building a Semantic Search Engine for Claude Code History
The Problem
After a few months of using Claude Code daily, I had hundreds of session files sitting in ~/.claude/projects/. Every session is a JSONL file full of messages, tool calls, file edits, and bash commands. Grep works if I remember the exact command I ran or the exact filename I edited, but it falls apart for anything conceptual — "that time I debugged the Docker networking issue" or "the session where I set up the Traefik reverse proxy."
I also started using OpenCode, which stores its sessions in a SQLite database at a completely different path with a different schema. Now my history was split across two tools with no unified way to search either of them.
So I built a semantic search tool that indexes everything and lets me search by meaning instead of keywords.
Architecture Overview
The pipeline looks like this:
Session Sources
├── Claude Code (~/.claude/projects/*/*.jsonl)
└── OpenCode (~/.local/share/opencode/opencode.db)
│
▼
Parser Plugins (one per source, common Turn interface)
│
▼
Two-Level Document Building
├── Session docs (all user messages, token-aware chunks)
└── Turn docs (individual Q&A exchanges)
│
▼
Embedding (Qwen3-Embedding-0.6B, 1024-dim, 8K ctx)
├── Local: llama-cpp-python + Metal (default on macOS)
└── Remote: any OpenAI-compatible /v1/embeddings endpoint
│
▼
Index on Disk (~/.cache/session-search/index/)
├── session_embeddings.npy + sessions.json (vector)
├── turn_embeddings.npy + turns.json (vector)
├── index.db (SQLite FTS5 for BM25)
└── index-state.json (staleness tracking)
│
▼
Hybrid Search (Relative Score Fusion, α=0.7 semantic / 0.3 keyword)
No vector database, no framework. Just numpy arrays for vectors, SQLite FTS5 for keyword search, and JSON metadata — all under ~/.cache/session-search/. The only heavy dependency is llama-cpp-python, which embeds the model in-process with Metal acceleration on macOS so there's no separate server to run.
Three Search Levels
The core feature is a three-level drill-down that matches how I actually think about finding past work.
Level 1: Session Search
"Which conversation was it?" This searches session-level embeddings — each session is represented by all its user messages concatenated together. Good for finding which project and which session touched a topic.
$ scripts/search.sh search "Docker networking bridge configuration"
SESSION SEARCH: "Docker networking bridge configuration"
Searched 342 sessions across 28 projects (claude+opencode)
============================================================
[0.724] [claude] session=a1b2c3d4e5f6 project=homelab date=2026-01-15 turns=23
slug: docker-network-debug
preview: The containers on the backend network can't reach each other...
files: docker-compose.yaml, nginx.conf, traefik.toml
[0.681] [opencode] session=f7e8d9c0b1a2 project=infra-setup date=2026-01-08 turns=12
preview: I need to set up a bridge network that allows...
files: docker-compose.yaml, .env
Drill deeper: search "<query>" --depth 2 --session <session_id>
Level 2: Turn Search
"What was the exact exchange?" This searches individual turns — each turn is a user message paired with the assistant's response. You can scope it to a specific session from Level 1, or search globally across all turns.
$ scripts/search.sh search "Docker networking" --depth 2 --session a1b2c3d4e5f6
TURN SEARCH: "Docker networking"
Scope: session a1b2c3d4e5f6
============================================================
[0.812] [claude] turn=a1b2c3d4e5f6:7 (8/23) project=homelab date=2026-01-15
Q: The containers still can't ping each other. I verified the network exists...
A: The issue is that your containers are on the default bridge network, not your
custom one. The default bridge doesn't support DNS resolution between containers...
tools: Bash, Read, Edit
Deep context: show a1b2c3d4:5-9
Level 3: Show
"Give me the full context." This isn't a search — it re-parses the raw session and shows the complete conversation including tool calls, file edits, bash commands, and their outputs.
$ scripts/search.sh show a1b2c3d4:5-9
============================================================
DEEP CONTEXT: homelab / docker-network-debug
Showing turns 5-9 of 23 total
============================================================
--- Turn 5 (2026-01-15) ---
USER: Can you check what networks Docker currently has?
ASSISTANT: Let me check the current Docker networks.
[Bash] docker network ls
-> NETWORK ID NAME DRIVER SCOPE
a1b2c3d4e5f6 bridge bridge local
f7e8d9c0b1a2 host host local
...
--- Turn 6 (2026-01-15) ---
USER: None of my containers are on the custom network...
The escalation strategy is: Level 1 to find the session, Level 2 to find the exact turn, Level 3 to see everything that happened around it.
Hybrid Search: Vector + BM25
Pure vector search turned out to miss too many literal terms — proper nouns, CLI flag names, exact config keys. "Traefik" or --no-verify would get semantically paraphrased into irrelevant neighbors. So every document that gets embedded also lands in a SQLite FTS5 table for BM25 keyword search, and every query runs both searches in parallel.
The two result sets get combined with Relative Score Fusion: vector cosine similarity and BM25 scores are each min-max normalized to 0–1, then blended with a tunable weight α.
ALPHA = 0.7 # 70% semantic, 30% keyword
def relative_score_fusion(vector_results, bm25_results, id_key_fn, alpha=ALPHA):
norm_vec = _min_max_normalize(vector_results)
norm_bm25 = _min_max_normalize(bm25_results)
all_ids = set(norm_vec) | set(norm_bm25)
return sorted(
((doc_id,
alpha * norm_vec.get(doc_id, 0.0) + (1 - alpha) * norm_bm25.get(doc_id, 0.0))
for doc_id in all_ids),
key=lambda x: x[1], reverse=True,
)
Unlike Reciprocal Rank Fusion (which throws away score magnitude and only uses rank positions), RSF preserves how confident each backend is, so the fused number is actually meaningful for thresholding — above 0.65 is a strong match, 0.45–0.65 is moderate, below that is weak. A --vector-only flag disables BM25 if you want to see what pure semantic search looks like.
Embedding Pipeline
The tool uses Qwen3-Embedding-0.6B — a 600M parameter embedding model that produces 1024-dimensional vectors with an 8K token context window. By default it runs in-process via llama-cpp-python with Metal acceleration, and the GGUF file is auto-downloaded from HuggingFace on first run and cached under ~/.cache/session-search/models/. No separate server, no network hop.
For non-Mac machines, or anyone who'd rather hit an existing embedding server, setting EMBEDDINGS_URL switches the tool to remote mode and posts batches to any OpenAI-compatible /v1/embeddings endpoint. A .env file in the skill directory holds the URL, API key, model name, and EMBEDDINGS_MAX_TOKENS — the last one drives chunk sizing automatically at 80% of the model's context window, so swapping models doesn't require code changes.
Token-Aware Chunking and Normalization
Long sessions get split into chunks sized to 80% of the model's context window (MAX_TOKENS = int(EMBEDDINGS_MAX_TOKENS * 0.8)), split at turn boundaries so no single Q/A exchange ever gets cut in half. That 80% ceiling leaves headroom for the query instruction prefix and a safety margin against token-counting drift between the tokenizer and the embedding server.
Remote mode batches requests concurrently via ThreadPoolExecutor and checkpoints progress every 100 documents so an interrupted indexing run resumes where it left off instead of starting over. If a batch fails with a 400 or 500, the tool falls back to embedding texts individually, substituting zero vectors for any that still fail. After all embeddings are collected, they get L2-normalized so that cosine similarity becomes a simple dot product:
arr = np.array(all_embeddings, dtype=np.float32)
norms = np.linalg.norm(arr, axis=1, keepdims=True)
norms = np.maximum(norms, 1e-10)
return arr / norms
Search then just needs a matrix multiply:
def search_vectors(query_vec, embeddings, top_k=5, mask=None):
scores = embeddings @ query_vec
top_idx = np.argpartition(scores, -k)[-k:]
top_idx = top_idx[np.argsort(scores[top_idx])[::-1]]
return [(int(indices[i]), float(scores[i])) for i in top_idx]
Queries get a special instruction prefix before embedding — this is how Qwen3's asymmetric retrieval works. Documents are embedded as-is, but queries get prepended with an instruction that tells the model what kind of retrieval to do:
QUERY_INSTRUCTION = "Instruct: Given a search query, retrieve relevant conversation excerpts that answer the query\nQuery: "
Incremental Indexing
Rebuilding the entire index every time would be slow, so the tool tracks what's already been indexed and only processes changes.
For Claude Code, it stores the mtime and file size of each JSONL file in index-state.json:
stat = jsonl_path.stat()
file_info = {"mtime": stat.st_mtime, "size": stat.st_size}
if not args_full and key in state and state[key] == file_info:
unchanged_keys.add(key)
else:
files_to_process.append(jsonl_path)
state[key] = file_info
When merging, the tool keeps the existing embeddings for unchanged sessions and only embeds the new/changed ones, then merges them with np.vstack:
def merge_arrays(kept_idx, existing, new):
if kept_idx and existing is not None:
kept = existing[kept_idx]
return np.vstack([kept, new]) if new.size else kept
return new
Before every search, there's an auto-staleness check. If anything has changed since the last index, it runs an incremental re-index automatically:
def auto_reindex(local=False):
if not index_is_stale():
return
print("Index is stale, updating...", file=sys.stderr)
cmd_index(IndexArgs())
This means you never have to think about whether the index is current — it just stays up to date.
Supporting Both Claude Code and OpenCode
This was the trickiest part. Claude Code stores sessions as JSONL files — one JSON object per line, with each line containing a message and metadata. OpenCode stores sessions in a SQLite database with a completely different schema: separate tables for sessions, messages, and parts (sub-message content blocks).
Rather than branching on source type throughout the codebase, each tool gets its own parser module under scripts/parsers/. They share a common BaseParser interface that yields Turn objects with a normalized shape, so the indexer, chunker, and search engine don't know or care where a turn came from. Adding support for a new AI coding tool is a matter of dropping in a new parser file — no changes anywhere else.
Schema Differences
Claude Code messages look like:
{"type": "human", "message": {"role": "user", "content": [{"type": "text", "text": "..."}]}, "timestamp": "2026-01-15T10:30:00Z"}
OpenCode has a normalized relational structure: a session table with project references, a message table with role and timing, and a part table where actual content lives (text blocks, tool calls with their state). Tool inputs use camelCase (filePath) instead of snake_case (file_path), and timestamps are millisecond integers instead of ISO strings.
The tool normalizes these differences during parsing. For example, OpenCode tool names are lowercased single words (bash, read, edit) while Claude Code uses capitalized names (Bash, Read, Edit):
OPENCODE_TOOL_MAP = {
"bash": "Bash", "read": "Read", "write": "Write", "edit": "Edit",
"glob": "Glob", "grep": "Grep", "list": "LS", "fetch": "WebFetch",
}
And input keys get normalized inline:
if isinstance(inp, dict) and "filePath" in inp:
inp = dict(inp)
inp["file_path"] = inp.pop("filePath")
Incremental Indexing Per Source
The incremental indexing strategy differs by source. Claude Code is simple — compare file mtime and size. OpenCode is trickier because all sessions live in one SQLite file. The tool tracks both the database's overall mtime and each session's time_updated field:
if args_full or db_mtime != oc_state.get("db_mtime"):
for sess in all_sessions:
if (not args_full
and sess["id"] in oc_session_state
and sess["time_updated"] == oc_session_state[sess["id"]]):
unchanged_keys.add(key)
else:
sessions_to_process.append(sess)
If the DB file hasn't changed at all, skip everything. If it has, check individual sessions by their time_updated timestamp to avoid re-parsing sessions that weren't touched.
Filtering Noise
Both sources produce a lot of noise that would pollute the index. The tool skips:
- Tool-only messages with no real user text (just tool results being passed back)
- Short messages under 30 characters (
MIN_USER_CHARS = 30) - JSON blob messages (system metadata that starts with
{and contains"type") - Internal Claude Code event types like
file-history-snapshot,progress,queue-operation
For mega-sessions that exceed the model's context window, session documents get split at turn boundaries into token-bounded chunks (~6500 tokens by default). Each chunk gets its own embedding but shares the same session metadata, and results are deduplicated by session ID during search so one big conversation doesn't flood the top results.
Integration as a Claude Code Skill
The tool is wired into Claude Code as a custom skill, packaged as a self-contained directory that anyone can drop into ~/.claude/skills/:
session-search/
├── SKILL.md
├── scripts/
│ ├── search.sh # dispatcher — sources .env, activates venv
│ ├── session_search.py # CLI entry
│ ├── search_engine.py # hybrid search + RSF fusion
│ ├── index_store.py # embeddings + FTS5 persistence
│ ├── embeddings.py # local llama-cpp-python + remote client
│ └── parsers/
│ ├── base.py
│ ├── claude_code.py
│ └── opencode.py
└── references/
├── HOW_IT_WORKS.md
├── CONFIGURATION.md
├── FIRST_RUN.md
├── ADDING_PARSERS.md
└── TROUBLESHOOTING.md
The SKILL.md file defines what it is and what tools it's allowed to use:
---
name: session-search
description: Hybrid semantic + keyword search across AI coding tool session history
---
Invoking /session-search <query> triggers the skill, which walks through the escalation levels. The references/ docs are loaded lazily by Claude only when the user asks a matching question — "how does scoring work?" pulls in HOW_IT_WORKS.md, "how do I add a new source?" pulls in ADDING_PARSERS.md. That way the main SKILL.md stays small and doesn't burn context on every invocation.
On first run, the skill walks the user through a short setup: local vs remote embedding model, which endpoint to use if remote, and whether to disable Claude Code's 30-day session cleanup so older history stays searchable.
Closing Thoughts
This has become one of my most-used tools. Being able to search hundreds of sessions by meaning — "how did I fix that CORS issue" or "the session where I configured Tailscale" — and have BM25 quietly catch the exact proper nouns semantic search would have missed is a huge quality-of-life improvement.
The full index across ~660 sessions and ~5500 turns weighs in around 25 MB on disk. Building it from scratch takes a few minutes on a MacBook Air running the model locally via Metal, and incremental re-indexes after that take well under a second. The only hard dependency is llama-cpp-python — everything else is standard library plus numpy — and the remote-endpoint path still works for anyone who'd rather offload embedding to a GPU elsewhere.