Add keywords_tsv column and tsvector indexing pipeline #31
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Goal
Store a PostgreSQL
tsvectoron eachpagesrow so full-text search queries can run efficiently via a GIN index.Acceptance criteria
pages.keywords_tsv tsvector nullable+ GIN indexPagemodel casts/exposes the column (rawtsvectorstring — no PHP-side casting needed, queried directly in SQL)KeywordsIndexedListener(or inline in a newIndexPageAction) receives body text + detected language → callsto_tsvector(pg_config, body)→ writeskeywords_tsvon the page rowen/en-*→english,pt/pt-*→portuguese,fr/fr-*→french, etc. Unmapped or null →simple. Routing lives in a dedicatedLanguageConfigResolverclass (testable independently).ProcessCrawlJobafter a successful fetch — passes the extracted body text and page language to the indexing actionLanguageConfigResolverunit tests for all supported languages + fallback; integration test confirming a crawled page ends up with a non-nullkeywords_tsv; test requires real PostgreSQL (tag with@group pgsqlor similar so SQLite runs skip it)Notes
body_textcolumn.keywords_tsvis the only persistent artifact of the body.to_tsvectoris PG-only — this ticket depends on #22 (Postgres test runs in CI).en-US,pt-BR) need prefix matching, not exact match, for config routing.