Exclude permanently failed pages from search results #35

Closed
opened 2026-04-29 23:57:01 +02:00 by myrmidex · 0 comments
Owner

Context

A page that returns 404 or fails repeatedly will have status=failed and null keywords_tsv. The search query already filters on status=fetched and non-null keywords_tsv, so these pages won't appear in results today. However there's no mechanism to mark a page as permanently gone vs transiently failed, and no admin visibility into the failure rate.

Acceptance criteria

  • Define "permanently failed" threshold: N consecutive failed crawls (e.g. 3) with outcome IN (blocked_4xx) — 4xx means gone, not transient
  • New status=dead on PageStatusEnum for pages that have crossed the permanent failure threshold
  • ProcessCrawlJob promotes status=dead when threshold is crossed (check page_crawls count with blocked_4xx outcome for this page)
  • dead pages are excluded from search (search query already filters on status=fetched — dead pages are safe if they never reach fetched. But add explicit status != dead guard for clarity)
  • Admin instances page (or a new admin pages view) shows count of dead pages per instance
  • Tests: page crosses 4xx threshold → status promoted to dead; 5xx does NOT trigger dead (transient); dead pages excluded from search query

Notes

  • blocked_robots outcomes should NOT count toward dead — robots.txt can change
  • timeout and blocked_5xx are transient — don't count toward dead
  • Only blocked_4xx is semantically "this page is gone"
  • Re-crawl scheduler (previous ticket) should skip dead pages entirely
## Context A page that returns 404 or fails repeatedly will have `status=failed` and null `keywords_tsv`. The search query already filters on `status=fetched` and non-null `keywords_tsv`, so these pages won't appear in results today. However there's no mechanism to mark a page as permanently gone vs transiently failed, and no admin visibility into the failure rate. ## Acceptance criteria - [ ] Define "permanently failed" threshold: N consecutive failed crawls (e.g. 3) with `outcome IN (blocked_4xx)` — 4xx means gone, not transient - [ ] New `status=dead` on `PageStatusEnum` for pages that have crossed the permanent failure threshold - [ ] `ProcessCrawlJob` promotes `status=dead` when threshold is crossed (check `page_crawls` count with `blocked_4xx` outcome for this page) - [ ] `dead` pages are excluded from search (search query already filters on `status=fetched` — dead pages are safe if they never reach fetched. But add explicit `status != dead` guard for clarity) - [ ] Admin instances page (or a new admin pages view) shows count of dead pages per instance - [ ] Tests: page crosses 4xx threshold → status promoted to dead; 5xx does NOT trigger dead (transient); dead pages excluded from search query ## Notes - `blocked_robots` outcomes should NOT count toward dead — robots.txt can change - `timeout` and `blocked_5xx` are transient — don't count toward dead - Only `blocked_4xx` is semantically "this page is gone" - Re-crawl scheduler (previous ticket) should skip `dead` pages entirely
myrmidex added this to the v0.2 milestone 2026-04-29 23:57:01 +02:00
myrmidex self-assigned this 2026-04-29 23:57:01 +02:00
myrmidex added the
enhancement
label 2026-05-01 01:02:00 +02:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: lvl0/trove#35
No description provided.