Four Costs of Building Full-Text Search Into SQLite

I added full-text search to Mission Control's activity log last week. The activity log records every mutation across the system: task moves, plan generations, tool calls, review decisions. Nineteen distinct action types from multiple agents and surfaces, all flowing through a single writer and indexed for search.

SQLite's FTS5 looked like the simple path. One virtual table declaration, three columns to index, done. It took about 30 minutes to write the CREATE VIRTUAL TABLE statement. It took the rest of the week to deal with what the docs don't prepare you for.

FTS5's dotted tokens break MATCH parsing

Every action in the log follows a domain.verb convention. task.create, task.move, proposal.decide, tool.use. The full planning and execution layer runs through these 19 dotted action names.

FTS5's default unicode61 tokenizer treats . as a token separator. So task.move becomes two separate tokens: task and move. That part is documented, if you read the right section.

Here's what catches you. An unquoted MATCH 'task.move' doesn't search for the compound action name. The tokenizer splits it into two separate tokens at query time too, and FTS5 runs an implicit AND search: any row containing both task and move somewhere in the indexed columns. For my activity log, that's nearly every row. Not a parse error. Worse: silently wrong results that look plausible until you check the count.

The fix is uniform double-quoting. Every MATCH query, no exceptions:

javascript

const q = (term) => '"' + String(term).replace(/"/g, '""') + '"';

This wraps every search term in a quoted phrase. Not because some queries need it, but because all 19 action names are dotted, and an unquoted search on any of them fails.

There's an alternative: tokenchars '.' in the tokenizer config keeps dots as part of tokens. I didn't use it. The FTS table also indexes detail_json, which contains error messages, stack traces, and JSON payloads. For that content, punctuation-as-separator is exactly right. tokenchars is global to the entire FTS table. You can't split behavior per column.

FTS5 external content tables mean manual sync

The FTS5 external content table declaration includes content='activity_log', which tells the index to read data from the source table instead of storing its own copy. Saves disk. Sounds like it would handle synchronization for you.

It doesn't. "External content" means "I'll point queries at your table." It does not mean "I'll watch your table for changes." You write the INSERT, DELETE, and UPDATE triggers yourself. FTS5 does zero bookkeeping on the content table.

Three triggers, each wiring a different mutation path. The INSERT trigger is straightforward. The DELETE trigger is where it gets strange:

sql

CREATE TRIGGER activity_log_ad AFTER DELETE ON activity_log BEGIN
  INSERT INTO activity_log_fts(activity_log_fts, rowid, action, actor_id, detail_json)
  VALUES ('delete', old.rowid, old.action, old.actor_id, old.detail_json);
END;

Read that again. It's an INSERT that deletes. The first column in the list is the table name itself, used as a column. The value 'delete' is a command token telling FTS5 to remove the entry. This syntax is specific to FTS5. It's different from FTS4, different from anything else in SQLite.

The docs mention this once, in the "External Content Tables" section, without much explanation of why the syntax looks the way it does.

The UPDATE trigger combines both operations: delete the old entry, insert the new one. Two FTS writes for every row update.

The unicode61 tokenizer rewrites your search model

The unicode61 tokenizer lowercases everything and strips punctuation. For prose (blog posts, chat messages, notes), this is desirable. Case-insensitive, punctuation-agnostic search is the whole point of full-text search.

For structured event names, it breaks the model:

task.create becomes tokens task, create
task.move becomes tokens task, move
operator.abort becomes tokens operator, abort

Fourteen of my 19 actions share a domain prefix: task.*, tool.*, operator.*, review.*. A bare MATCH for "task" returns every task-related row. That's the busiest category in the log.

The fix, again, is phrase matching through the q() helper. Searching for "task.create" matches the token sequence task followed immediately by create at adjacent positions. Position-adjacent matching works correctly even with the dot stripped. But you have to know this. And you have to apply it to every query.

I considered tokenchars '.' here too. Same tradeoff: it would fix the domain prefix problem for action columns but break tokenization for detail_json, where dots in error messages and JSON keys need to be separators.

Sync triggers double your write cost

External-content FTS5 saves disk because the index stores no duplicate text. The cost shows up on the write path instead.

INSERT: 1 source write + 1 FTS insert = 2 table writes
UPDATE: 1 source write + 1 FTS delete + 1 FTS insert = 3 table writes
DELETE: 1 source write + 1 FTS delete = 2 table writes

For an activity log that captures every mutation in the app, this adds up. Every task move, every tool call, every review decision generates a row. Each row fires a trigger.

The alternative is a regular (non-external-content) FTS5 table that auto-syncs but stores a full copy of every indexed column. On a $7 VPS where disk is finite, I chose write amplification over storage duplication. Worth it here. Probably not worth it if your FTS table indexes short, rarely-updated records.

The 23 tests in the suite validate sync correctness on every code path. They run against the real migration SQL file, not a re-typed inline copy. If the DDL ever drifts from what the tests exercise, the tests break immediately. A missed sync trigger means phantom search results or stale matches, the kind of gap between "it works" and production that bites you at the worst time.

The pattern

These aren't bugs. FTS5 does exactly what its documentation says, if you read all the right sections in the right order and connect the implications yourself. The real issue is that FTS5 was designed for document search: long text, natural language, case-insensitive matching. When you feed it structured data with dotted tokens and domain-prefixed event names, every default needs revisiting.

The CREATE VIRTUAL TABLE statement took 30 minutes. The quoting layer, the sync triggers, the test coverage to validate all of it, and the tradeoff analysis for tokenchars vs phrase matching took the rest of the week. That ratio is the cost nobody warns you about.

What I'm changing

I'm keeping FTS5. Search performance is good, the disk savings from external content are real, and better-sqlite3 handles it without friction. But I'm treating it as infrastructure with its own maintenance cost now, not a convenience feature. The virtual table is the invitation. The three triggers and the quoting layer are where you actually live.

Four Costs of Building Full-Text Search Into SQLite

FTS5's dotted tokens break MATCH parsing

FTS5 external content tables mean manual sync

The unicode61 tokenizer rewrites your search model

Sync triggers double your write cost

The pattern

What I'm changing

Get new posts in your inbox

Keep reading

32% of managers cut a role for AI, then rehired for the same one

What broke the AI coding productivity narrative in a week?

The AI coding subsidy era ended today