Exclude permanently failed pages from search results #35
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
A page that returns 404 or fails repeatedly will have
status=failedand nullkeywords_tsv. The search query already filters onstatus=fetchedand non-nullkeywords_tsv, so these pages won't appear in results today. However there's no mechanism to mark a page as permanently gone vs transiently failed, and no admin visibility into the failure rate.Acceptance criteria
outcome IN (blocked_4xx)— 4xx means gone, not transientstatus=deadonPageStatusEnumfor pages that have crossed the permanent failure thresholdProcessCrawlJobpromotesstatus=deadwhen threshold is crossed (checkpage_crawlscount withblocked_4xxoutcome for this page)deadpages are excluded from search (search query already filters onstatus=fetched— dead pages are safe if they never reach fetched. But add explicitstatus != deadguard for clarity)Notes
blocked_robotsoutcomes should NOT count toward dead — robots.txt can changetimeoutandblocked_5xxare transient — don't count toward deadblocked_4xxis semantically "this page is gone"deadpages entirely