Re-crawl scheduler for stale and failed pages #34

New issue

Open

opened 2026-04-29 23:56:33 +02:00 by myrmidex · 0 comments

myrmidex commented

2026-04-29 23:56:33 +02:00

Owner

Context

Pages are currently crawled once and never revisited. Content changes, pages go offline, servers recover from outages. Without re-crawling:

keywords_tsv goes stale
Pages that were temporarily down stay status=failed forever
Pages that return 404/410 permanently stay in results

Acceptance criteria

Scheduled command (or extension of existing queue logic) that re-enqueues pages for crawling:
- status=fetched pages older than N days (configurable via CRAWLER_RECRAWL_INTERVAL_DAYS, default 30)
- status=failed pages with failed_at older than M days (configurable, default 7) — give transient failures a retry window
Re-crawl inserts a new page_crawls row (existing pattern — one row per attempt)
Pages with status=rejected are NOT re-crawled (non-HTML content is unlikely to change type)
Deduplication: don't enqueue a page that already has a pending page_crawls row (outcome IS NULL)
Schedule: daily()->withoutOverlapping(60) — re-crawling is background work, not time-sensitive
Tests: pages past the stale threshold get enqueued; pages within threshold don't; rejected pages skipped; pending pages skipped (idempotent)

Notes

On re-crawl success: ProcessCrawlJob already updates pages.status, fetched_at, and will write new keywords_tsv (once that ticket lands). No special re-crawl handling needed in the job.
Start conservative on intervals — 30 days for fetched, 7 days for failed. Tune based on real-world queue depth once live.

## Context Pages are currently crawled once and never revisited. Content changes, pages go offline, servers recover from outages. Without re-crawling: - `keywords_tsv` goes stale - Pages that were temporarily down stay `status=failed` forever - Pages that return 404/410 permanently stay in results ## Acceptance criteria - [ ] Scheduled command (or extension of existing queue logic) that re-enqueues pages for crawling: - `status=fetched` pages older than N days (configurable via `CRAWLER_RECRAWL_INTERVAL_DAYS`, default 30) - `status=failed` pages with `failed_at` older than M days (configurable, default 7) — give transient failures a retry window - [ ] Re-crawl inserts a new `page_crawls` row (existing pattern — one row per attempt) - [ ] Pages with `status=rejected` are NOT re-crawled (non-HTML content is unlikely to change type) - [ ] Deduplication: don't enqueue a page that already has a pending `page_crawls` row (`outcome IS NULL`) - [ ] Schedule: `daily()->withoutOverlapping(60)` — re-crawling is background work, not time-sensitive - [ ] Tests: pages past the stale threshold get enqueued; pages within threshold don't; rejected pages skipped; pending pages skipped (idempotent) ## Notes - On re-crawl success: `ProcessCrawlJob` already updates `pages.status`, `fetched_at`, and will write new `keywords_tsv` (once that ticket lands). No special re-crawl handling needed in the job. - Start conservative on intervals — 30 days for fetched, 7 days for failed. Tune based on real-world queue depth once live.