Add keywords_tsv column and tsvector indexing pipeline #31

Closed
opened 2026-04-29 23:56:04 +02:00 by myrmidex · 0 comments
Owner

Goal

Store a PostgreSQL tsvector on each pages row so full-text search queries can run efficiently via a GIN index.

Acceptance criteria

  • Migration adds pages.keywords_tsv tsvector nullable + GIN index
  • Page model casts/exposes the column (raw tsvector string — no PHP-side casting needed, queried directly in SQL)
  • A KeywordsIndexedListener (or inline in a new IndexPageAction) receives body text + detected language → calls to_tsvector(pg_config, body) → writes keywords_tsv on the page row
  • Language → PostgreSQL text search config routing: en/en-*english, pt/pt-*portuguese, fr/fr-*french, etc. Unmapped or null → simple. Routing lives in a dedicated LanguageConfigResolver class (testable independently).
  • The listener is triggered from ProcessCrawlJob after a successful fetch — passes the extracted body text and page language to the indexing action
  • Tests: LanguageConfigResolver unit tests for all supported languages + fallback; integration test confirming a crawled page ends up with a non-null keywords_tsv; test requires real PostgreSQL (tag with @group pgsql or similar so SQLite runs skip it)

Notes

  • Body text is transient — passed through the event/action, never persisted to a body_text column. keywords_tsv is the only persistent artifact of the body.
  • to_tsvector is PG-only — this ticket depends on #22 (Postgres test runs in CI).
  • BCP-47 regional subtags (en-US, pt-BR) need prefix matching, not exact match, for config routing.
## Goal Store a PostgreSQL `tsvector` on each `pages` row so full-text search queries can run efficiently via a GIN index. ## Acceptance criteria - [ ] Migration adds `pages.keywords_tsv tsvector nullable` + GIN index - [ ] `Page` model casts/exposes the column (raw `tsvector` string — no PHP-side casting needed, queried directly in SQL) - [ ] A `KeywordsIndexedListener` (or inline in a new `IndexPageAction`) receives body text + detected language → calls `to_tsvector(pg_config, body)` → writes `keywords_tsv` on the page row - [ ] Language → PostgreSQL text search config routing: `en`/`en-*` → `english`, `pt`/`pt-*` → `portuguese`, `fr`/`fr-*` → `french`, etc. Unmapped or null → `simple`. Routing lives in a dedicated `LanguageConfigResolver` class (testable independently). - [ ] The listener is triggered from `ProcessCrawlJob` after a successful fetch — passes the extracted body text and page language to the indexing action - [ ] Tests: `LanguageConfigResolver` unit tests for all supported languages + fallback; integration test confirming a crawled page ends up with a non-null `keywords_tsv`; test requires real PostgreSQL (tag with `@group pgsql` or similar so SQLite runs skip it) ## Notes - Body text is transient — passed through the event/action, never persisted to a `body_text` column. `keywords_tsv` is the only persistent artifact of the body. - `to_tsvector` is PG-only — this ticket depends on #22 (Postgres test runs in CI). - BCP-47 regional subtags (`en-US`, `pt-BR`) need prefix matching, not exact match, for config routing.
myrmidex added this to the v0.2 milestone 2026-04-29 23:56:04 +02:00
myrmidex self-assigned this 2026-04-29 23:56:04 +02:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lvl0/trove#31
No description provided.