Crawler Data Model #7
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Rewritten 2026-04-26. Original text proposed a
crawled_pagestable,text_contentcolumn, andoutbound_links jsoncolumn — all conflict with locked-in decisions from #4 and "Trove stores signals not content":crawled_pageswould duplicatepages(the canonical URL row from #4 withPageStatusEnum).text_contentviolates "page body is never persisted; tokenization writeskeywords_tsvand drops the body".outbound_links jsonwould denormalize thepage_linkscitation graph from #4.This rewrite delivers crawler-pipeline scaffolding without re-litigating those decisions. Cache columns on
pageswere further trimmed after design discussion: only fields with a real v0.1 consumer survive onpages; everything per-attempt lives onpage_crawls.Schema design
Single table:
page_crawlsOne row per crawl attempt. Rows are never deleted — the table is an append-only history. The same row is also the work signal: a row with
outcome IS NULLmeans "this crawl is pending or in flight." Re-crawl after success = NEW row, samepage_id. Transient retries that don't reachcompleted_atare also new rows — there is no per-row retry counter.No
attemptscolumn — count attempts per page viaPageCrawl::where('page_id', $id)->count(). Application logic (worker #14) handles retry/backoff by querying recent rows; no per-row counter needed.No
attempted_atcolumn —locked_atalready means "worker claimed this attempt."created_atalready means "this attempt was scheduled/inserted." Together they cover all the lifecycle questions without a third timestamp.No
locked_bycolumn — v0.1 will run a single worker process; identifying which worker holds a lock is YAGNI until multi-worker setups exist (post-v0.1).locked_at != nullis sufficient to mean "in flight." Reclaiming abandoned locks usesWHERE locked_at < NOW() - INTERVAL '15 min', no worker-identity check needed.No
unique(page_id)— multiple rows per page is the point. Filter byoutcome IS NULLin worker queries to ignore completed history.Indexes
(domain)partialWHERE outcome IS NULL— politeness lookups (#11)(scheduled_for, locked_at)partialWHERE outcome IS NULL— worker poll(page_id, created_at DESC)— per-page crawl history view (rows ordered by insertion = attempt time)pagesextension — minimalOnly one new column on
pagesbecause that's the only one with a real v0.1 consumer:pages.languagestring(35)nullable, indexed — detected language (BCP-47, e.g.en,pt-BR,zh-Hant-TW). 35 chars covers RFC 5646's full subtag combination ceiling. Sticky per page (doesn't change between fetches of the same URL). Read by the keywords/tokenization listener (#11) to pick the righttsvectorconfig (english,simple, etc.) and by the eventual search UI for query routing (WHERE language = 'en'filter).Explicitly NOT added (deferred to the tickets that actually need them):
pages.html_hash— change-detection optimization. No re-crawl scheduling in v0.1, so no consumer reads this. The ticket that adds re-crawl decides where the hash lives.pages.status_code/pages.last_error— one-query savings on a low-traffic admin view. Not worth the duplication. Live onpage_crawls.status_code/page_crawls.error_message; reachable viaPage::latestCrawl()when actually needed.State machine
Models
App\Models\PageCrawl—$fillable,$casts(outcome → CrawlOutcomeEnum, all timestamps → datetime).belongsTo(Page::class).App\Enums\CrawlOutcomeEnum— backed string enum (success,failed,timeout,blocked_robots,blocked_4xx,blocked_5xx). Already committed in9dd6d84.PageCrawlFactory—InstanceFactorypattern: minimal defaults, named state methods (->successful(),->failed(string $message),->scheduledAt(Carbon),->locked()). Thelocked()state method takes no arguments now (nolocked_by).App\Models\Page— extend$fillablewithlanguage(already done inb1b7ade).hasMany(PageCrawl::class)for full history;latestCrawl()for "give me the most recent attempt" viahasOne(PageCrawl::class)->latestOfMany('created_at').Tests
PageCrawlTest(unit) — fillable, casts,belongsTo(Page)relationshipPageCrawlFactoryTest(unit) — state methods produce the right shapesPageTestadditions —crawls()andlatestCrawl()relationshipsOut of scope (later tickets)
page_crawlsfrompages WHERE status=Discovered→ #8 Queue populationdomaincolumn → #11 Per-domain politenessstatus_code/error_message/ language → #12 HTTP fetcherkeywords_tsv, dropping body → #13 Language detection (also readspages.language)Acceptance
languagetopages(commitb1b7ade)App\Enums\CrawlOutcomeEnumwith the 6 cases (commit9dd6d84)page_crawlswith columns + partial indexes aboveApp\Models\PageCrawlwith factory + testsPage::crawls()(HasMany) +Page::latestCrawl()(HasOne via latestOfMany oncreated_at) + tests