Crawler Data Model #7

New issue

Closed

opened 2026-04-23 02:27:21 +02:00 by myrmidex · 0 comments

myrmidex commented

2026-04-23 02:27:21 +02:00

Owner

Context

Rewritten 2026-04-26. Original text proposed a crawled_pages table, text_content column, and outbound_links json column — all conflict with locked-in decisions from #4 and "Trove stores signals not content":

crawled_pages would duplicate pages (the canonical URL row from #4 with PageStatusEnum).
text_content violates "page body is never persisted; tokenization writes keywords_tsv and drops the body".
outbound_links json would denormalize the page_links citation graph from #4.

This rewrite delivers crawler-pipeline scaffolding without re-litigating those decisions. Cache columns on pages were further trimmed after design discussion: only fields with a real v0.1 consumer survive on pages; everything per-attempt lives on page_crawls.

Schema design

Single table: `page_crawls`

One row per crawl attempt. Rows are never deleted — the table is an append-only history. The same row is also the work signal: a row with outcome IS NULL means "this crawl is pending or in flight." Re-crawl after success = NEW row, same page_id. Transient retries that don't reach completed_at are also new rows — there is no per-row retry counter.

id              bigint PK
page_id         bigint FK → pages.id, cascadeOnDelete
domain          string, indexed (per-domain politeness; populated at insert time from page URL)
priority        smallInteger default 0
scheduled_for   timestampTz default NOW() via useCurrent(), indexed
locked_at       timestampTz nullable    (worker claims by stamping this; null = available)
completed_at    timestampTz nullable    (worker stamps when finished, success or fail)
outcome         CrawlOutcomeEnum nullable (success / failed / timeout / blocked_robots / blocked_4xx / blocked_5xx; null = pending or in-flight)
status_code     smallint nullable
error_message   text nullable
created_at, updated_at  timestampTz

No attempts column — count attempts per page via PageCrawl::where('page_id', $id)->count(). Application logic (worker #14) handles retry/backoff by querying recent rows; no per-row counter needed.

No attempted_at column — locked_at already means "worker claimed this attempt." created_at already means "this attempt was scheduled/inserted." Together they cover all the lifecycle questions without a third timestamp.

No locked_by column — v0.1 will run a single worker process; identifying which worker holds a lock is YAGNI until multi-worker setups exist (post-v0.1). locked_at != null is sufficient to mean "in flight." Reclaiming abandoned locks uses WHERE locked_at < NOW() - INTERVAL '15 min', no worker-identity check needed.

No unique(page_id) — multiple rows per page is the point. Filter by outcome IS NULL in worker queries to ignore completed history.

Indexes

(domain) partial WHERE outcome IS NULL — politeness lookups (#11)
(scheduled_for, locked_at) partial WHERE outcome IS NULL — worker poll
(page_id, created_at DESC) — per-page crawl history view (rows ordered by insertion = attempt time)

`pages` extension — minimal

Only one new column on pages because that's the only one with a real v0.1 consumer:

pages.language string(35) nullable, indexed — detected language (BCP-47, e.g. en, pt-BR, zh-Hant-TW). 35 chars covers RFC 5646's full subtag combination ceiling. Sticky per page (doesn't change between fetches of the same URL). Read by the keywords/tokenization listener (#11) to pick the right tsvector config (english, simple, etc.) and by the eventual search UI for query routing (WHERE language = 'en' filter).

Explicitly NOT added (deferred to the tickets that actually need them):

pages.html_hash — change-detection optimization. No re-crawl scheduling in v0.1, so no consumer reads this. The ticket that adds re-crawl decides where the hash lives.
pages.status_code / pages.last_error — one-query savings on a low-traffic admin view. Not worth the duplication. Live on page_crawls.status_code / page_crawls.error_message; reachable via Page::latestCrawl() when actually needed.

State machine

URL discovered → pages row exists, no page_crawls row
                ↓ enqueue (#8 Queue population)
page_crawls row inserted, scheduled_for=now, outcome=null   (= pending)
                ↓ worker claims
locked_at set                                               (= in flight)
                ↓ fetch finishes
completed_at, outcome, status_code/error_message set; locked_at cleared
pages.status flipped (fetched | failed); pages.fetched_at | failed_at set
                ↓ retry (transient failure) OR re-crawl scheduled (post-v0.1)
NEW page_crawls row inserted (same page_id, fresh scheduled_for, outcome=null)
                ↓ worker claims, etc.

Models

App\Models\PageCrawl — $fillable, $casts (outcome → CrawlOutcomeEnum, all timestamps → datetime). belongsTo(Page::class).
App\Enums\CrawlOutcomeEnum — backed string enum (success, failed, timeout, blocked_robots, blocked_4xx, blocked_5xx). Already committed in 9dd6d84.
PageCrawlFactory — InstanceFactory pattern: minimal defaults, named state methods (->successful(), ->failed(string $message), ->scheduledAt(Carbon), ->locked()). The locked() state method takes no arguments now (no locked_by).
App\Models\Page — extend $fillable with language (already done in b1b7ade). hasMany(PageCrawl::class) for full history; latestCrawl() for "give me the most recent attempt" via hasOne(PageCrawl::class)->latestOfMany('created_at').

Tests

PageCrawlTest (unit) — fillable, casts, belongsTo(Page) relationship
PageCrawlFactoryTest (unit) — state methods produce the right shapes
PageTest additions — crawls() and latestCrawl() relationships

Out of scope (later tickets)

Worker logic claiming/processing rows → #14 Queue worker
Populating page_crawls from pages WHERE status=Discovered → #8 Queue population
Per-domain politeness using the domain column → #11 Per-domain politeness
HTTP fetch writing back status_code / error_message / language → #12 HTTP fetcher
Tokenization writing keywords_tsv, dropping body → #13 Language detection (also reads pages.language)
Re-crawl scheduling + hash-based change detection → post-v0.1
Multi-worker locking with worker identity → post-v0.1

Acceptance

Migration adding language to pages (commit b1b7ade)
App\Enums\CrawlOutcomeEnum with the 6 cases (commit 9dd6d84)
Migration creating page_crawls with columns + partial indexes above
App\Models\PageCrawl with factory + tests
Page::crawls() (HasMany) + Page::latestCrawl() (HasOne via latestOfMany on created_at) + tests
PLATFORM.md updated with the schema and the rationale for "history-as-queue, no retry counter, single-worker v0.1" design

## Context **Rewritten 2026-04-26.** Original text proposed a `crawled_pages` table, `text_content` column, and `outbound_links json` column — all conflict with locked-in decisions from #4 and "Trove stores signals not content": - `crawled_pages` would duplicate `pages` (the canonical URL row from #4 with `PageStatusEnum`). - `text_content` violates "page body is never persisted; tokenization writes `keywords_tsv` and drops the body". - `outbound_links json` would denormalize the `page_links` citation graph from #4. This rewrite delivers crawler-pipeline scaffolding without re-litigating those decisions. Cache columns on `pages` were further trimmed after design discussion: only fields with a real v0.1 consumer survive on `pages`; everything per-attempt lives on `page_crawls`. ## Schema design ### Single table: `page_crawls` **One row per crawl attempt.** Rows are **never deleted** — the table is an append-only history. The same row is also the work signal: a row with `outcome IS NULL` means "this crawl is pending or in flight." Re-crawl after success = NEW row, same `page_id`. Transient retries that don't reach `completed_at` are also new rows — there is no per-row retry counter. ``` id bigint PK page_id bigint FK → pages.id, cascadeOnDelete domain string, indexed (per-domain politeness; populated at insert time from page URL) priority smallInteger default 0 scheduled_for timestampTz default NOW() via useCurrent(), indexed locked_at timestampTz nullable (worker claims by stamping this; null = available) completed_at timestampTz nullable (worker stamps when finished, success or fail) outcome CrawlOutcomeEnum nullable (success / failed / timeout / blocked_robots / blocked_4xx / blocked_5xx; null = pending or in-flight) status_code smallint nullable error_message text nullable created_at, updated_at timestampTz ``` **No `attempts` column** — count attempts per page via `PageCrawl::where('page_id', $id)->count()`. Application logic (worker #14) handles retry/backoff by querying recent rows; no per-row counter needed. **No `attempted_at` column** — `locked_at` already means "worker claimed this attempt." `created_at` already means "this attempt was scheduled/inserted." Together they cover all the lifecycle questions without a third timestamp. **No `locked_by` column** — v0.1 will run a single worker process; identifying which worker holds a lock is YAGNI until multi-worker setups exist (post-v0.1). `locked_at != null` is sufficient to mean "in flight." Reclaiming abandoned locks uses `WHERE locked_at < NOW() - INTERVAL '15 min'`, no worker-identity check needed. **No `unique(page_id)`** — multiple rows per page is the point. Filter by `outcome IS NULL` in worker queries to ignore completed history. ### Indexes - `(domain)` partial `WHERE outcome IS NULL` — politeness lookups (#11) - `(scheduled_for, locked_at)` partial `WHERE outcome IS NULL` — worker poll - `(page_id, created_at DESC)` — per-page crawl history view (rows ordered by insertion = attempt time) ### `pages` extension — minimal Only one new column on `pages` because that's the only one with a real v0.1 consumer: - `pages.language` `string(35)` nullable, **indexed** — detected language (BCP-47, e.g. `en`, `pt-BR`, `zh-Hant-TW`). 35 chars covers RFC 5646's full subtag combination ceiling. Sticky per page (doesn't change between fetches of the same URL). Read by the keywords/tokenization listener (#11) to pick the right `tsvector` config (`english`, `simple`, etc.) and by the eventual search UI for query routing (`WHERE language = 'en'` filter). **Explicitly NOT added** (deferred to the tickets that actually need them): - `pages.html_hash` — change-detection optimization. No re-crawl scheduling in v0.1, so no consumer reads this. The ticket that adds re-crawl decides where the hash lives. - `pages.status_code` / `pages.last_error` — one-query savings on a low-traffic admin view. Not worth the duplication. Live on `page_crawls.status_code` / `page_crawls.error_message`; reachable via `Page::latestCrawl()` when actually needed. ### State machine ``` URL discovered → pages row exists, no page_crawls row ↓ enqueue (#8 Queue population) page_crawls row inserted, scheduled_for=now, outcome=null (= pending) ↓ worker claims locked_at set (= in flight) ↓ fetch finishes completed_at, outcome, status_code/error_message set; locked_at cleared pages.status flipped (fetched | failed); pages.fetched_at | failed_at set ↓ retry (transient failure) OR re-crawl scheduled (post-v0.1) NEW page_crawls row inserted (same page_id, fresh scheduled_for, outcome=null) ↓ worker claims, etc. ``` ## Models - `App\Models\PageCrawl` — `$fillable`, `$casts` (outcome → CrawlOutcomeEnum, all timestamps → datetime). `belongsTo(Page::class)`. - `App\Enums\CrawlOutcomeEnum` — backed string enum (`success`, `failed`, `timeout`, `blocked_robots`, `blocked_4xx`, `blocked_5xx`). **Already committed** in `9dd6d84`. - `PageCrawlFactory` — `InstanceFactory` pattern: minimal defaults, named state methods (`->successful()`, `->failed(string $message)`, `->scheduledAt(Carbon)`, `->locked()`). The `locked()` state method takes no arguments now (no `locked_by`). - `App\Models\Page` — extend `$fillable` with `language` (already done in `b1b7ade`). `hasMany(PageCrawl::class)` for full history; `latestCrawl()` for "give me the most recent attempt" via `hasOne(PageCrawl::class)->latestOfMany('created_at')`. ## Tests - `PageCrawlTest` (unit) — fillable, casts, `belongsTo(Page)` relationship - `PageCrawlFactoryTest` (unit) — state methods produce the right shapes - `PageTest` additions — `crawls()` and `latestCrawl()` relationships ## Out of scope (later tickets) - Worker logic claiming/processing rows → **#14 Queue worker** - Populating `page_crawls` from `pages WHERE status=Discovered` → **#8 Queue population** - Per-domain politeness using the `domain` column → **#11 Per-domain politeness** - HTTP fetch writing back `status_code` / `error_message` / language → **#12 HTTP fetcher** - Tokenization writing `keywords_tsv`, dropping body → **#13 Language detection** (also reads `pages.language`) - Re-crawl scheduling + hash-based change detection → **post-v0.1** - Multi-worker locking with worker identity → **post-v0.1** ## Acceptance - [x] Migration adding `language` to `pages` *(commit b1b7ade)* - [x] `App\Enums\CrawlOutcomeEnum` with the 6 cases *(commit 9dd6d84)* - [ ] Migration creating `page_crawls` with columns + partial indexes above - [ ] `App\Models\PageCrawl` with factory + tests - [ ] `Page::crawls()` (HasMany) + `Page::latestCrawl()` (HasOne via latestOfMany on `created_at`) + tests - [ ] PLATFORM.md updated with the schema and the rationale for "history-as-queue, no retry counter, single-worker v0.1" design