Crawler Data Model #7

Closed
opened 2026-04-23 02:27:21 +02:00 by myrmidex · 0 comments
Owner

Context

Rewritten 2026-04-26. Original text proposed a crawled_pages table, text_content column, and outbound_links json column — all conflict with locked-in decisions from #4 and "Trove stores signals not content":

  • crawled_pages would duplicate pages (the canonical URL row from #4 with PageStatusEnum).
  • text_content violates "page body is never persisted; tokenization writes keywords_tsv and drops the body".
  • outbound_links json would denormalize the page_links citation graph from #4.

This rewrite delivers crawler-pipeline scaffolding without re-litigating those decisions. Cache columns on pages were further trimmed after design discussion: only fields with a real v0.1 consumer survive on pages; everything per-attempt lives on page_crawls.

Schema design

Single table: page_crawls

One row per crawl attempt. Rows are never deleted — the table is an append-only history. The same row is also the work signal: a row with outcome IS NULL means "this crawl is pending or in flight." Re-crawl after success = NEW row, same page_id. Transient retries that don't reach completed_at are also new rows — there is no per-row retry counter.

id              bigint PK
page_id         bigint FK → pages.id, cascadeOnDelete
domain          string, indexed (per-domain politeness; populated at insert time from page URL)
priority        smallInteger default 0
scheduled_for   timestampTz default NOW() via useCurrent(), indexed
locked_at       timestampTz nullable    (worker claims by stamping this; null = available)
completed_at    timestampTz nullable    (worker stamps when finished, success or fail)
outcome         CrawlOutcomeEnum nullable (success / failed / timeout / blocked_robots / blocked_4xx / blocked_5xx; null = pending or in-flight)
status_code     smallint nullable
error_message   text nullable
created_at, updated_at  timestampTz

No attempts column — count attempts per page via PageCrawl::where('page_id', $id)->count(). Application logic (worker #14) handles retry/backoff by querying recent rows; no per-row counter needed.

No attempted_at columnlocked_at already means "worker claimed this attempt." created_at already means "this attempt was scheduled/inserted." Together they cover all the lifecycle questions without a third timestamp.

No locked_by column — v0.1 will run a single worker process; identifying which worker holds a lock is YAGNI until multi-worker setups exist (post-v0.1). locked_at != null is sufficient to mean "in flight." Reclaiming abandoned locks uses WHERE locked_at < NOW() - INTERVAL '15 min', no worker-identity check needed.

No unique(page_id) — multiple rows per page is the point. Filter by outcome IS NULL in worker queries to ignore completed history.

Indexes

  • (domain) partial WHERE outcome IS NULL — politeness lookups (#11)
  • (scheduled_for, locked_at) partial WHERE outcome IS NULL — worker poll
  • (page_id, created_at DESC) — per-page crawl history view (rows ordered by insertion = attempt time)

pages extension — minimal

Only one new column on pages because that's the only one with a real v0.1 consumer:

  • pages.language string(35) nullable, indexed — detected language (BCP-47, e.g. en, pt-BR, zh-Hant-TW). 35 chars covers RFC 5646's full subtag combination ceiling. Sticky per page (doesn't change between fetches of the same URL). Read by the keywords/tokenization listener (#11) to pick the right tsvector config (english, simple, etc.) and by the eventual search UI for query routing (WHERE language = 'en' filter).

Explicitly NOT added (deferred to the tickets that actually need them):

  • pages.html_hash — change-detection optimization. No re-crawl scheduling in v0.1, so no consumer reads this. The ticket that adds re-crawl decides where the hash lives.
  • pages.status_code / pages.last_error — one-query savings on a low-traffic admin view. Not worth the duplication. Live on page_crawls.status_code / page_crawls.error_message; reachable via Page::latestCrawl() when actually needed.

State machine

URL discovered → pages row exists, no page_crawls row
                ↓ enqueue (#8 Queue population)
page_crawls row inserted, scheduled_for=now, outcome=null   (= pending)
                ↓ worker claims
locked_at set                                               (= in flight)
                ↓ fetch finishes
completed_at, outcome, status_code/error_message set; locked_at cleared
pages.status flipped (fetched | failed); pages.fetched_at | failed_at set
                ↓ retry (transient failure) OR re-crawl scheduled (post-v0.1)
NEW page_crawls row inserted (same page_id, fresh scheduled_for, outcome=null)
                ↓ worker claims, etc.

Models

  • App\Models\PageCrawl$fillable, $casts (outcome → CrawlOutcomeEnum, all timestamps → datetime). belongsTo(Page::class).
  • App\Enums\CrawlOutcomeEnum — backed string enum (success, failed, timeout, blocked_robots, blocked_4xx, blocked_5xx). Already committed in 9dd6d84.
  • PageCrawlFactoryInstanceFactory pattern: minimal defaults, named state methods (->successful(), ->failed(string $message), ->scheduledAt(Carbon), ->locked()). The locked() state method takes no arguments now (no locked_by).
  • App\Models\Page — extend $fillable with language (already done in b1b7ade). hasMany(PageCrawl::class) for full history; latestCrawl() for "give me the most recent attempt" via hasOne(PageCrawl::class)->latestOfMany('created_at').

Tests

  • PageCrawlTest (unit) — fillable, casts, belongsTo(Page) relationship
  • PageCrawlFactoryTest (unit) — state methods produce the right shapes
  • PageTest additions — crawls() and latestCrawl() relationships

Out of scope (later tickets)

  • Worker logic claiming/processing rows → #14 Queue worker
  • Populating page_crawls from pages WHERE status=Discovered#8 Queue population
  • Per-domain politeness using the domain column → #11 Per-domain politeness
  • HTTP fetch writing back status_code / error_message / language → #12 HTTP fetcher
  • Tokenization writing keywords_tsv, dropping body → #13 Language detection (also reads pages.language)
  • Re-crawl scheduling + hash-based change detection → post-v0.1
  • Multi-worker locking with worker identity → post-v0.1

Acceptance

  • Migration adding language to pages (commit b1b7ade)
  • App\Enums\CrawlOutcomeEnum with the 6 cases (commit 9dd6d84)
  • Migration creating page_crawls with columns + partial indexes above
  • App\Models\PageCrawl with factory + tests
  • Page::crawls() (HasMany) + Page::latestCrawl() (HasOne via latestOfMany on created_at) + tests
  • PLATFORM.md updated with the schema and the rationale for "history-as-queue, no retry counter, single-worker v0.1" design
## Context **Rewritten 2026-04-26.** Original text proposed a `crawled_pages` table, `text_content` column, and `outbound_links json` column — all conflict with locked-in decisions from #4 and "Trove stores signals not content": - `crawled_pages` would duplicate `pages` (the canonical URL row from #4 with `PageStatusEnum`). - `text_content` violates "page body is never persisted; tokenization writes `keywords_tsv` and drops the body". - `outbound_links json` would denormalize the `page_links` citation graph from #4. This rewrite delivers crawler-pipeline scaffolding without re-litigating those decisions. Cache columns on `pages` were further trimmed after design discussion: only fields with a real v0.1 consumer survive on `pages`; everything per-attempt lives on `page_crawls`. ## Schema design ### Single table: `page_crawls` **One row per crawl attempt.** Rows are **never deleted** — the table is an append-only history. The same row is also the work signal: a row with `outcome IS NULL` means "this crawl is pending or in flight." Re-crawl after success = NEW row, same `page_id`. Transient retries that don't reach `completed_at` are also new rows — there is no per-row retry counter. ``` id bigint PK page_id bigint FK → pages.id, cascadeOnDelete domain string, indexed (per-domain politeness; populated at insert time from page URL) priority smallInteger default 0 scheduled_for timestampTz default NOW() via useCurrent(), indexed locked_at timestampTz nullable (worker claims by stamping this; null = available) completed_at timestampTz nullable (worker stamps when finished, success or fail) outcome CrawlOutcomeEnum nullable (success / failed / timeout / blocked_robots / blocked_4xx / blocked_5xx; null = pending or in-flight) status_code smallint nullable error_message text nullable created_at, updated_at timestampTz ``` **No `attempts` column** — count attempts per page via `PageCrawl::where('page_id', $id)->count()`. Application logic (worker #14) handles retry/backoff by querying recent rows; no per-row counter needed. **No `attempted_at` column** — `locked_at` already means "worker claimed this attempt." `created_at` already means "this attempt was scheduled/inserted." Together they cover all the lifecycle questions without a third timestamp. **No `locked_by` column** — v0.1 will run a single worker process; identifying which worker holds a lock is YAGNI until multi-worker setups exist (post-v0.1). `locked_at != null` is sufficient to mean "in flight." Reclaiming abandoned locks uses `WHERE locked_at < NOW() - INTERVAL '15 min'`, no worker-identity check needed. **No `unique(page_id)`** — multiple rows per page is the point. Filter by `outcome IS NULL` in worker queries to ignore completed history. ### Indexes - `(domain)` partial `WHERE outcome IS NULL` — politeness lookups (#11) - `(scheduled_for, locked_at)` partial `WHERE outcome IS NULL` — worker poll - `(page_id, created_at DESC)` — per-page crawl history view (rows ordered by insertion = attempt time) ### `pages` extension — minimal Only one new column on `pages` because that's the only one with a real v0.1 consumer: - `pages.language` `string(35)` nullable, **indexed** — detected language (BCP-47, e.g. `en`, `pt-BR`, `zh-Hant-TW`). 35 chars covers RFC 5646's full subtag combination ceiling. Sticky per page (doesn't change between fetches of the same URL). Read by the keywords/tokenization listener (#11) to pick the right `tsvector` config (`english`, `simple`, etc.) and by the eventual search UI for query routing (`WHERE language = 'en'` filter). **Explicitly NOT added** (deferred to the tickets that actually need them): - `pages.html_hash` — change-detection optimization. No re-crawl scheduling in v0.1, so no consumer reads this. The ticket that adds re-crawl decides where the hash lives. - `pages.status_code` / `pages.last_error` — one-query savings on a low-traffic admin view. Not worth the duplication. Live on `page_crawls.status_code` / `page_crawls.error_message`; reachable via `Page::latestCrawl()` when actually needed. ### State machine ``` URL discovered → pages row exists, no page_crawls row ↓ enqueue (#8 Queue population) page_crawls row inserted, scheduled_for=now, outcome=null (= pending) ↓ worker claims locked_at set (= in flight) ↓ fetch finishes completed_at, outcome, status_code/error_message set; locked_at cleared pages.status flipped (fetched | failed); pages.fetched_at | failed_at set ↓ retry (transient failure) OR re-crawl scheduled (post-v0.1) NEW page_crawls row inserted (same page_id, fresh scheduled_for, outcome=null) ↓ worker claims, etc. ``` ## Models - `App\Models\PageCrawl` — `$fillable`, `$casts` (outcome → CrawlOutcomeEnum, all timestamps → datetime). `belongsTo(Page::class)`. - `App\Enums\CrawlOutcomeEnum` — backed string enum (`success`, `failed`, `timeout`, `blocked_robots`, `blocked_4xx`, `blocked_5xx`). **Already committed** in `9dd6d84`. - `PageCrawlFactory` — `InstanceFactory` pattern: minimal defaults, named state methods (`->successful()`, `->failed(string $message)`, `->scheduledAt(Carbon)`, `->locked()`). The `locked()` state method takes no arguments now (no `locked_by`). - `App\Models\Page` — extend `$fillable` with `language` (already done in `b1b7ade`). `hasMany(PageCrawl::class)` for full history; `latestCrawl()` for "give me the most recent attempt" via `hasOne(PageCrawl::class)->latestOfMany('created_at')`. ## Tests - `PageCrawlTest` (unit) — fillable, casts, `belongsTo(Page)` relationship - `PageCrawlFactoryTest` (unit) — state methods produce the right shapes - `PageTest` additions — `crawls()` and `latestCrawl()` relationships ## Out of scope (later tickets) - Worker logic claiming/processing rows → **#14 Queue worker** - Populating `page_crawls` from `pages WHERE status=Discovered` → **#8 Queue population** - Per-domain politeness using the `domain` column → **#11 Per-domain politeness** - HTTP fetch writing back `status_code` / `error_message` / language → **#12 HTTP fetcher** - Tokenization writing `keywords_tsv`, dropping body → **#13 Language detection** (also reads `pages.language`) - Re-crawl scheduling + hash-based change detection → **post-v0.1** - Multi-worker locking with worker identity → **post-v0.1** ## Acceptance - [x] Migration adding `language` to `pages` *(commit b1b7ade)* - [x] `App\Enums\CrawlOutcomeEnum` with the 6 cases *(commit 9dd6d84)* - [ ] Migration creating `page_crawls` with columns + partial indexes above - [ ] `App\Models\PageCrawl` with factory + tests - [ ] `Page::crawls()` (HasMany) + `Page::latestCrawl()` (HasOne via latestOfMany on `created_at`) + tests - [ ] PLATFORM.md updated with the schema and the rationale for "history-as-queue, no retry counter, single-worker v0.1" design
myrmidex added this to the v0.1 milestone 2026-04-23 02:27:21 +02:00
myrmidex added the
enhancement
label 2026-04-26 01:28:09 +02:00
myrmidex self-assigned this 2026-04-26 12:17:47 +02:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lvl0/trove#7
No description provided.