Re-crawl scheduler for stale and failed pages #34

Open
opened 2026-04-29 23:56:33 +02:00 by myrmidex · 0 comments
Owner

Context

Pages are currently crawled once and never revisited. Content changes, pages go offline, servers recover from outages. Without re-crawling:

  • keywords_tsv goes stale
  • Pages that were temporarily down stay status=failed forever
  • Pages that return 404/410 permanently stay in results

Acceptance criteria

  • Scheduled command (or extension of existing queue logic) that re-enqueues pages for crawling:
    • status=fetched pages older than N days (configurable via CRAWLER_RECRAWL_INTERVAL_DAYS, default 30)
    • status=failed pages with failed_at older than M days (configurable, default 7) — give transient failures a retry window
  • Re-crawl inserts a new page_crawls row (existing pattern — one row per attempt)
  • Pages with status=rejected are NOT re-crawled (non-HTML content is unlikely to change type)
  • Deduplication: don't enqueue a page that already has a pending page_crawls row (outcome IS NULL)
  • Schedule: daily()->withoutOverlapping(60) — re-crawling is background work, not time-sensitive
  • Tests: pages past the stale threshold get enqueued; pages within threshold don't; rejected pages skipped; pending pages skipped (idempotent)

Notes

  • On re-crawl success: ProcessCrawlJob already updates pages.status, fetched_at, and will write new keywords_tsv (once that ticket lands). No special re-crawl handling needed in the job.
  • Start conservative on intervals — 30 days for fetched, 7 days for failed. Tune based on real-world queue depth once live.
## Context Pages are currently crawled once and never revisited. Content changes, pages go offline, servers recover from outages. Without re-crawling: - `keywords_tsv` goes stale - Pages that were temporarily down stay `status=failed` forever - Pages that return 404/410 permanently stay in results ## Acceptance criteria - [ ] Scheduled command (or extension of existing queue logic) that re-enqueues pages for crawling: - `status=fetched` pages older than N days (configurable via `CRAWLER_RECRAWL_INTERVAL_DAYS`, default 30) - `status=failed` pages with `failed_at` older than M days (configurable, default 7) — give transient failures a retry window - [ ] Re-crawl inserts a new `page_crawls` row (existing pattern — one row per attempt) - [ ] Pages with `status=rejected` are NOT re-crawled (non-HTML content is unlikely to change type) - [ ] Deduplication: don't enqueue a page that already has a pending `page_crawls` row (`outcome IS NULL`) - [ ] Schedule: `daily()->withoutOverlapping(60)` — re-crawling is background work, not time-sensitive - [ ] Tests: pages past the stale threshold get enqueued; pages within threshold don't; rejected pages skipped; pending pages skipped (idempotent) ## Notes - On re-crawl success: `ProcessCrawlJob` already updates `pages.status`, `fetched_at`, and will write new `keywords_tsv` (once that ticket lands). No special re-crawl handling needed in the job. - Start conservative on intervals — 30 days for fetched, 7 days for failed. Tune based on real-world queue depth once live.
myrmidex added this to the v0.2 milestone 2026-04-29 23:56:33 +02:00
myrmidex self-assigned this 2026-04-29 23:56:33 +02:00
myrmidex added the
enhancement
label 2026-05-01 01:02:00 +02:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lvl0/trove#34
No description provided.