Re-crawl scheduler for stale and failed pages #34
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Pages are currently crawled once and never revisited. Content changes, pages go offline, servers recover from outages. Without re-crawling:
keywords_tsvgoes stalestatus=failedforeverAcceptance criteria
status=fetchedpages older than N days (configurable viaCRAWLER_RECRAWL_INTERVAL_DAYS, default 30)status=failedpages withfailed_atolder than M days (configurable, default 7) — give transient failures a retry windowpage_crawlsrow (existing pattern — one row per attempt)status=rejectedare NOT re-crawled (non-HTML content is unlikely to change type)page_crawlsrow (outcome IS NULL)daily()->withoutOverlapping(60)— re-crawling is background work, not time-sensitiveNotes
ProcessCrawlJobalready updatespages.status,fetched_at, and will write newkeywords_tsv(once that ticket lands). No special re-crawl handling needed in the job.